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Abstract 

This  thesis  develops  a  framework  for  secure  distributed  time,  and  uses  this  framework  to  build 
secure  protocols  for  practical  problems.  In  distributed  systems,  many  important  problems — such  as 
detecting  potential  causality,  obtaining  global  states,  and  recovering  from  process  failure — center 
on  temporal  relations  more  general  than  the  linear  order  of  real  time.  Systems  with  asynchronous 
message  passing  require  a  partial  order  time  model,  and  systems  with  multiple  levels  of  abstraction 
require  multiple  levels  of  time  models.  Building  clock  primitives  for  these  time  models  facilitates 
building  protocols  for  these  application  problems.  However,  protocols  built  (even  tacitly)  on 
such  clocks  open  themselves  to  security  and  privacy  risks,  since  tracking  these  temporal  relations 
requires  sharing  and  trusting  private  information. 

This  thesis  addresses  these  issues  of  time  and  security  by  constructing  a  distributed  time  formalism 
that  supports  hierarchies  of  general  time  models,  md  then  constructing  clock  primitives — the 
Signed  Vector  Timestamp  protocol  and  the  Sealed  Vector  Timestamp  protocol — that  provide  security 
and  privacy.  Framing  application  problems  in  terms  of  this  distributed  time  framework  grants 
insight  that  often  allows  us  to  build  protocols  more  general  and  flexible  than  were  previously 
possible.  Separating  clocks  from  protocols  grants  additional  flexibility  by  allowing  us  to  keep  their 
design  issues  mutually  transparent. 

This  thesis  explores  three  applications  of  this  secure  distributed  time  framework.  We  transparently 
add  security  and  privacy  to  immediate  ordered  service  protocols.  We  build  basic  distributed 
snapshot  protocols  and  transparently  add  security,  privacy,  and  increased  flexibility.  Finally,  we 
use  the  framework  to  build  a  new  optimistic  rollback  recovery  protocol  that  substantially  improves 
on  previous  work  by  providing  full  asynchrony  while  also  reducing  the  worst-case  bound  for 
rollbacks  after  a  failure  from  exponential  to  one  per  process;  further,  developing  this  protocol 
within  the  distributed  time  framework  transparently  allows  for  security  and  privacy. 
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Chapter  1 
Introduction 


Many  problems  in  distributed  systems  center  on  temporal  relations  more  general  than  the  linear 
order  of  real  time,  or  even  on  a  single  layer  of  a  more  general  order.  Put  simply,  distributed  systems 
need  distributed  time.  Recognizing  the  central  role  that  distributed  time  plays  creates  opportunities; 

•  Using  the  appropriate  temporal  relations  can  allow  deeper  insight  into  the  nature  of  applica¬ 
tion  problems. 

•  Providing  clock  primitives  for  these  temporal  relations  permits  construction  of  clearer,  more 
flexible,  and  more  general  protocols  for  these  problems. 

However,  recognizing  the  central  role  that  distributed  time  plays  also  leads  to  recognizing  a  signif¬ 
icant  problem: 

•  Protocols  built  (even  implicitly)  on  distributed  time  are  open  to  security  and  privacy  risks, 
since  tracking  these  temporal  relations  requires  sharing  private  information,  and  requires 
trusting  the  information  that  is  shared.  Attacks  on  the  lower-level  clocks  lead  to  attacks  on 
the  higher-level  protocols. 

This  thesis  identifies  and  resolves  these  issues  by  building  a  framework  for  secure  distributed 
time,  and  by  using  this  framework  to  build  secure  distributed  protocols. 


1.1.  Distributed  Time 


Our  first  intuitions  organize  experience  into  a  linear  sequence  of  discrete  events.  However,  this 
approach  is  inappropriate  for  asynchronous  distributed  systems,  where  information  is  distributed 
and  perception  is  delayed.  Distributed  environments  require  a  distributed  notion  of  time,  to  abstract 
away  not  only  irrelevant  physical  detail  but  also  irrelevant  temporal  and  computational  detail.  By 
better  expressing  distributed  systems  concepts  that  are  difficult  to  talk  about  in  terms  of  real  time, 
and  by  distinguishing  what  “actually  happens”  from  what  physically  occurs,  a  theory  of  distributed 
time  provides  a  natural  framework  for  solving  problems  in  distributed  environments. 


Chapter  2  lays  the  groundwork  for  these  tasks  by  reviewing  the  theory  of  distributed  time  we 
developed  for  this  thesis.  This  theory  improves  on  previous  work  on  time  in  distributed  systems  by 
supporting  temporal  relations  more  general  than  partial  orders,  by  supporting  abstraction  through 
multiple  levels  of  temporal  relations,  by  separating  the  family  of  temporal  relations  an  application 
consults  from  the  particular  clock  implementations  that  track  them,  and  by  providing  a  single  arena 
in  which  to  consider  these  issues  for  a  wide  range  of  applications. 

One  central  claim  of  this  thesis  is  that  distributed  time  provides  a  framework  for  building 
general  protocols  for  distributed  systems  application  problems.  We  can  first  phrase  problems  in 
terms  of  distributed  time,  and  then  phrase  protocols  in  terms  of  distributed  time  clock  primitives. 
Chapter  2  through  Chapter  4  develop  this  claim  by  considering  several  application  problems: 

•  Potential  Causality  Determining  whether  one  event  could  potentially  have  influenced 
another  requires  sorting  events  in  the  partial  order  determined  by  the  asynchronous  compu¬ 
tation,  rather  than  in  the  linear  order  determined  by  real  time.  Clocks  for  partial  order  time 
directly  support  building  protocols  for  problems  such  as  orphan  detection  and  immediate 
ordered  service  that  reduce  to  sorting  based  on  potential  causality. 

•  Snapshots  and  Global  States  Distribution  and  asynchrony  make  it  difficult  for  a 
process  to  determine  the  state  of  the  system  at  any  given  instant,  since  anything  that  the 
process  can  perceive  about  other  processes  will  be  out-of-date.  However,  phrasing  snapshots 
as  timeslices  from  a  time  model  provides  a  way  to  use  clocks  for  these  models  to  capture 
general  snapshots  and  to  reason  about  global  states.  Phrasing  the  problem  this  way  allows 
us  to  extend  a  basic  protocol  by  substituting  clocks  for  more  general  temporal  relations,  and 
to  address  performance  concerns  by  substituting  clock  implementations. 

•  Optimistic  Rollback  Recovery  The  problem  of  rollback  recovery  arises  whe .  a  process, 
due  to  some  type  of  failure,  must  roll  back  events  and  restart  execution  (possibly  with 
modified  replay).  Recovery  is  optimistic  when  other  processes  may  depend  on  the  lost 
events  at  the  failed  process.  Since  optimistic  recovery  requires  tracking  dependency,  many 
previous  approaches  use  some  form  of  partial  order  clocks,  and  thus  already  dovetail  nicely 
with  our  work.  However,  effectively  performing  this  recovery  asynchronously  requires 
tracking  potential  knowledge  of  failures  as  well.  This  potential  knowledge  relation  is  also 
expressed  by  a  partial  order  time  model — but  a  lower-level  model  than  the  dependency 
model.  The  distributed  time  framework  provides  the  tools  needed  to  clearly  talk  about  such 
hierarchies  of  time — ^and  thus  to  develop  new  rollback  protocols  that  improve  on  previous 
work. 

The  distributed  time  framework  introduces  orthogonality  between  clocks  and  the  higher-level 
protocols  that  use  them.  Besides  permitting  more  flexible  protocols,  this  orthogonality  has  an 
additional  benefit:  we  can  consider  clock  issues  on  the  clock  level,  independently  of  the  protocol 
issues.  This  approach  offers  advantages: 

•  Orthogonality  between  Time  Models  and  Protocols  Separating  clocks  from  proto¬ 
cols  provides  a  separation  between  time  models  and  protocols.  We  can  transparently  change 
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the  scope  of  a  protocol  by  substituting  clocks  for  a  different  time  model — for  example, 
a  change  to  the  model  underlying  the  snapshot  protocols  allows  us  to  capture  snapshots 
satisfying  the  property  of  having  no  messages  in  transit. 

•  Unification  of  Protocois  The  distributed  time  framework  unites  protocols  that  individ¬ 
ually  deal  with  distributed  time  issues.  This  unification  directly  allows  tasks  such  as  taking 
an  offline  snapshot  after  rollback  with  modified  replay.  For  example,  rollback  with  modified 
replay  creates  three  distinct  versions  of  the  computation:  the  failed  computation,  the  virtual 
failure-free  computation,  and  the  underlying  failure-plus-recovery  computation.  For  each  of 
these  computations,  scenarios  exist  where  a  snapshot  would  be  useful.  The  distributed  time 
framework  directly  supports  this  flexibility. 


1 .2.  Security  and  Privacy 

In  a  distributed  system,  a  process  can  detect  the  local  passage  of  real  time  by  examining  an 
independent  physical  device,  such  as  a  quartz  clock.  However,  to  track  more  general  temporal 
relations,  a  process  must  collect  and  share  private  information.  Consequently,  dealing  with  these 
relations — even  implicitly — exposes  protocols  to  security  and  privacy  risks: 


•  Is  the  information  a  process  receives  correct? 

•  Is  the  information  a  process  shares  being  used  for  dishonest  purposes? 


This  sharing  and  trusting  creates  opportunities  for  Byzantine  (malicious)  processes  to  manipulate 
the  clock  protocols,  and  consequently  to  manipulate  application  protocols  built  on  these  clock 
protocols.  The  orthogonality  that  distributed  time  introduces  between  clocks  and  protocols  thus 
has  the  additional  significant  benefit  of  creating  a  single  arena  in  which  to  examine  and  resolve  these 
security  issues.  Installing  clocks  that  protect  against  security  and  privacy  attacks  will  transparently 
provide  this  protection  to  higher-level  protocols. 

The  latter  part  of  this  thesis  examines  these  security  and  privacy  aspects  of  distributed  time. 
Chapter  5  begins  this  examination  by  considering  secure  clocks.  For  example,  the  standard  time- 
stamp  vector  mechanism  for  partial  order  time  permits  numerous  attacks.  We  catalog  these  attacks, 
and  present  two  protocols  that  provide  protection:  the  Signed  Vector  Timestamp  protocol  and  the 
and  Sealed  Vector  Timestamp  protocol.  We  discuss  scalability  and  implementation  issues,  and 
outline  avenues  for  further  research  into  secure  clocks. 

Chapter  6  then  uses  these  techniques  to  add  security  and  privacy  protection  to  the  distributed 
protocols  developed  in  Chapters  2  through  4. 
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1.3.  Overview  of  Previous  Work 


Previous  work  has  explored  partial  order  time  for  distributed  systems,  both  through  mathematical 
models  and  through  protocol  construction.  The  mathematical  work  provided  foundational  insights 
but  did  not  support  construction  of  clocks  and  protocols;  the  protocol  work  did  not  provide  a  fully 
general  framework,  and  consequently  did  not  exploit  the  full  power  of  the  temporal  abstraction 
being  performed.  Further,  the  security  challenges  raised  by  using  clocks  for  relations  more  general 
than  linear  time  were  unidentified  and  unsolved. 


Time  The  notion  that  the  linear  order  of  real  time  may  be  inappropriate  for  asynchronous  dis¬ 
tributed  systems  emerges  in  earlier  work.  Jefferson  [Je85]  used  linear  time  that  departs  from  the  real 
time  order.  Lamport  [La78]  used  partial  orders  to  track  causal  dependency  in  distributed  systems. 
Pratt  [Pr86]  argued  for  the  universality  of  partial  order  time.  Partial  order  temporal  relations  have 
also  emerged  in  the  areas  of  semantics  (e.g.,  [Gr75,  Pe80,  GaPr87,  CCMP89,  Win89])  and  artificial 
intelligence  (e.g.,  [6a93,  Bo93,  Ts87,  YaA193I).  Using  partial  orders  for  distributed  systems  is 
sometimes  called  logical  time',  Fidge  [Fi91]  presents  a  good  survey  paper,  and  very  recently  Yang 
and  Marsland  [YaMa93,  YaMa941  have  published  a  collection  of  some  of  the  principal  papers  on 
these  issues  (and  the  orthogonal  issues  of  total  order  clock  synchronization). 


Asynchrony  Previous  work  [BiJo87,  PBS89,  SES89]  has  also  explored  the  communication 
problems  introduced  by  asynchrony:  by  the  fact  that  the  underlying  temporal  structure  is  not  the 
linear  order  of  real  time.  One  proposed  solution  to  this  problem  is  to  fix  a  partial  order  structure  as 
the  causal  order  and  to  enforce  (via  multicasting)  that  processes  perceive  a  consistent  view  of  this 
order.  (The  appropriateness  and  scalability  of  this  solution  has  lately  generated  no  small  amount  of 
controversy  [ChSk93,  Bi94,  Co94,  Re94].)  Other  approaches  to  this  problem  include  frameworks 
to  adapt  protocols  for  the  asynchronous  partial  order  environment  after  developing  them  in  simpler 
environments  [Aw85,  Mo85,  NeTo90] . 


Protocols  Partial  order  time  has  also  appeared  in  various  forms  in  distributed  systems  appli¬ 
cations.  Some  of  these  areas  include  distributed  debugging  [Fi89,  Sp89I,  distributed  snapshots 
[ChLa85,  AhKs89,  Ma93]  and  the  use  of  distributed  snapshots  in  debugging  [CoMa91,  MaNe91, 
MaSa91,  GaWa94].  Partial  order  time  has  also  been  used  in  deadlock  detection  [Ma87,  TaLo91  ], 
immediate  ordered  service  [KeKo89],  and  rollback  recovery  [StYe85,  Jo89,  JoZw90,  ElZw92, 
PeKe93]. 


Clocks  Lamport  [La78]  proposed  a  clock  mechanism  that  allows  processes  in  an  asynchronous 
distributed  system  to  track  a  total  order  consistent  with  the  underlying  partial  order.  Fidge  [Fi881 
and  Mattem  [Ma89]  formally  explored  partial  order  time  and  concurrently  introduced  the  vector 
timestamp  mechanism.  (Protocols  essentially  identical  to  the  vector  timestamps  mechanism  also 
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independently  appeared  in  other  work  [StYe85,  KeKo89]).  Other  research  has  explored  opti¬ 
mizations  to  the  vector  clock  protocol  [SiKs90],  trading  decreased  accuracy  for  decreased  size 
[ACGS911,  and  limits  to  vector  size  [CB91  ]  and  to  clock  accuracy  [Va93]. 


Security  The  author  [Sm9 1  ]  identified  security  problems  in  the  vector  clock  and  Lamport  clock 
mechanisms,  and  introduced  the  Signed  Vector  Timestamp  protocol.  Reiter  and  Gong  [ReGo93] 
also  explored  this  area  and  independently  discovered  this  protocol.  Amman  and  Jajordia  [AmJa93] 
explored  some  issues  in  securely  generating  timestamps  in  the  face  of  confinement  levels. 


1 .4.  Thesis  Contributions 


This  thesis  uses  a  framework  for  general  temporal  relations  to  advance  the  state  of  the  art  both  in 
distributed  protocol  design  and  also  in  security  and  privacy  for  distributed  systems. 


Time  To  begin  with,  this  thesis  provides  a  fully-developed  formalism  to  talk  about  clocks  for 
temporal  relations  that  differ  from  the  linear  order  of  real  time.  This  formalism  improves  on  the 
foundational  work  by  allowing  us  to  talk  about  arbitrary  relations  (not  just  partial  orders,  and  not 
just  the  single  partial  order  of  information  flow)  and  hierarchies  of  abstraction  (not  just  a  single 
level),  and  allowing  us  to  build  clocks  for  these  relations  and  protocols  based  on  these  clocks. 


Protocols  This  thesis  then  applies  this  framework  to  the  example  problems  of  distributed  snap¬ 
shots  and  optimistic  rollback  recovery.  We  can  define  global  states  in  terms  of  distributed  time 
relations,  and  build  snapshot  protocols  in  terms  of  clock  queries;  this  approach  allows  us  to  substi¬ 
tute  clock  implementations  (e.g.,  for  increased  security)  and  to  substitute  underlying  time  models 
(e.g.,  to  capture  specialized  properties  or  to  examine  alternate  virtual  computations).  The  ability  to 
talk  about  multiple  levels  of  time  allows  us  to  build  an  optimistic  rollback  recovery  protocol  that 
provides  fully  asynchronous  recovery  while  also  reducing  the  worst  case  number  of  rollbacks  after 
a  failure  from  exponential  (as  in  Strom  and  Yemini’s  asynchronous  protocol  [StYe8S])  to  at  most 
one  per  process.  Further,  the  single  framework  of  distributed  time  allows  us  to  consider  in  one 
place  problems  and  protocols  separately  affecting  time  abstraction  . 


Secure  Clocks  This  research  was  the  first  to  identify  security  and  privacy  problems  inherent  in 
partial  order  time.  This  thesis  presents  both  the  first  secure  partial  order  clock  protocol,  as  well  as 
the  most  secure  clock  protocol  to  date.  This  latter  protocol  provides  security  and  privacy  despite 
any  number  of  corrupt  agents-and  extends  to  partial  order  temporal  structures  that  differ  from  the 
underlying  partial  order  of  information  flow. 
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SdCure  Protocols  This  thesis  demonstrates  a  systematic  and  transparent  way  to  add  secu¬ 
rity  and  privacy  protection  to  protocols  developed  within  this  secure  distributed  time  framework. 
Consequently,  this  work  shows  how  to  solve  application  problems  using  partial  order  time — while 
also  defending  against  espionage  and  Byzantine  attacks.  We  show  how  to  add  this  protection 
to  example  protocols  for  providing  immediate  ordered  service,  taking  distributed  snapshots,  and 
performing  optimistic  rollback  recovery. 

As  computer  systems  become  increasingly  distributed  and  user  applications  become  more 
attractive  to  attack,  the  issues  of  time  and  security  will  only  become  more  important.  This  thesis 
lays  the  groundwork  for  solving  these  problems. 

(A  glossary  follows  the  text  of  this  thesis.  This  glossary  presents  four  lists:  terms,  clock 
primitives,  time  models,  and  symbols.) 
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Chapter  2 

Distributed  Time 


This  chapter  reviews  the  theory  of  distributed  time,  a  general  framework  (developed  as  part  of 
this  thesis  research)  for  temporal  relations  in  distributed  systems.  Section  2. 1  presents  the  moti¬ 
vation  behind  the  theory.  Section  2.2  presents  tools  for  representing  and  abstracting  computation. 
Section  2.3  discusses  timeslices  and  global  states.  Section  2.4  specifies  and  builds  some  clock 
primitives.  Section  2.5  examines  some  example  applications.  Section  2.6  relates  this  chapter  to 
our  earlier,  more  detailed  publication  on  distributed  time  [Sm93]. 


2.1.  Overview 

Beyond  Real  Time  Normally  we  think  of  a  computation  as  a  sequence  of  states  and  events 
ordered  by  real  time.  However,  even  this  natural  view  performs  abstraction  from  full  physical 
detail  to  discrete  events.  Describing  asynchronous  distributed  computation  requires  extending  this 
abstraction  to  time:  if  two  events  occur  without  knowledge  of  each  other,  then  their  real  time 
sequence  does  not  matter  and  also  may  not  be  observable  [La78,  Pr86].  Consequently,  many 
application  problems  are  simplified  by  thinking  of  time  as  the  partial  order  determined  by  potential 
information  flow.  (This  relation  is  sometimes  referred  to  as  the  “Lamport  order,”  after  [La78],  and 
also  the  “causal  order,”  since  it  expresses  potential  causality.) 


Beyond  Partial  Order  Time  Pioneering  work  in  partial  order  time  [Fi88,  Ma89]  leaves  us 
thinking  about  computation  as  a  temporal  relation  on  a  set  of  objects — except  each  obJecLactually 
represents  the  activity  in  a  region  of  space-time,  and  the  relation  does  not  follow  directly  from  real 
time  order  on  these  regions.  Many  application  issues  suggest  that  we  should  continue  removing 
irrelevant  temporal  and  computational  detail — that  we  should  continue  the  process  of  abstraction; 


•  Using  multiple  levels  of  partial  order  time  clarifies  distributed  computations  that  fail  and 
recover. 

•  Omitting  the  details  of  recovery  facilitates  describing  the  failure-free  virtual  computation. 


•  The  temporal  relations  of  interest  may  not  necessarily  folio  from  the  information  flow 
partial  order.  One  example  is  the  partial  order  describing  the  virtual  computation  after  re¬ 
covery;  another  are  the  zigzag  paths  in  Xu  and  Netzer’s  recent  work  [XuNe93]  in  checkpoint 
coordination. 

•  The  temporal  relations  of  interest  may  not  necessarily  be  a  mathematical  order.  (As 
Section  2.2.3  discusses,  an  order  is  a  relation  both  transitive  and  antisymmetric.)  For 
example,  relaxing  the  transitivity  requirement  clarities  discussion  of  confinement  barriers 
and  individual  steps  in  information  flow.  Relaxing  the  acyclic  requirement  allows  a  natural 
way  to  unite  sets  of  events  into  atomic  units:  cycles. 


Extending  partial  order  time  to  a  general  framework  for  temporal  abstraction  provides  the  toots  to 
talk  about  these  scenarios.  Put  simply,  describing  a  distributed  computation  requires  a  theory  of 
distributed  time. 


Distributed  Time  for  Distributed  Protocols  A  theory  of  distributed  time  has  practical 
motivations  and  uses.  Consider  the  computation  performed  by  asynchronous  distributed  systems, 
with  processes  that  possess  no  common  clock,  that  fail  and  restart,  and  that  frequently  may  be 
disconnected  or  even  powered  down.  Many  application  problems  that  arise  in  these  systems 
reduce  to  asking  questions  about  temporal  relations  other  than  the  natural  real  time  sequence. 
Thinking  in  terms  of  these  alternative  temporal  relations  clarifies  these  problems;  providing  clocks 
for  these  relations  simplifies  protocol  design.  Indeed,  building  protocols  for  these  problems  requires 
confronting  these  clock  issues  in  one  form  or  another.  However,  exploiting  the  power  of  alternative 
temporal  relations  requires  understanding  the  underlying  framework.  The  remainder  of  this  chapter 
develops  these  formal  mechanisms  of  distributed  time. 

This  research  improves  on  earlier  work  by  providing  a  single,  general  theory  of  distributed 
time  suitable  for  a  wide  range  of  applications.  By  supporting  temporal  relations  more  general  than 
partial  orders  and  by  supporting  hierarchies  of  temporal  abstraction,  this  theory  can  express  the 
computational  abstraction  appropriate  ^or  families  of  application  problems.  By  providing  a  general 
approach  to  distributed  time,  this  theory  allows  us  to  unify  in  a  single  framework  protocols  that 
separately  consult  and  affect  time,  and  to  consider  once  the  clock  issues  centra]  to  each  separate 
protocol.  By  introducing  orthogonality  between  temporal  relations  and  the  clocks  that  track  them, 
this  theory  allows  us  to  consider  (and  alter)  clock  implementations  without  changing  higher-level 
protocols. 

Considering  these  goals  raises  some  critical  issues: 


•  We  want  to  represent  a  computation  as  some  abstract  set  of  “things  that  happened,”  with  a 
relation  indicating  the  temporal  order  in  which  these  things  happened. 

•  The  components  in  these  abstractions  themselves  represent  various  parts  of  a  literal  descrip¬ 
tion  of  what  physically  happened. 
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•  These  abstractions  should  permit  temporal  relations  more  general  than  that  of  linear  time. 

•  We  need  to  distinguish  between  the  way  we  obtain  the  abstract  representations  and  the 
representations  themselves,  since  we  may  have  multiple  routes  to  the  same  representation. 

•  We  want  to  be  able  to  ap^iy  abstractions  to  abstractions. 

We  conclude  that  a  general  theory  of  distributed  time  should  contain  three  components: 

•  a  standard  format  for  these  abstract  representations  (so  we  can  talk  about  computations); 

•  a  way  to  specify  time  models-,  representational  transformations  on  these  objects  (so  we  can 
abstract  from  one  representation  to  another);  and 

•  a  way  to  translate  some  level  of  physical  description  into  this  format  (so  our  chains  of 
abstraction  have  some  footing  in  reality). 

Distributed  time  models  provide  several  advantages: 

•  If  the  desired  physical  description  is  unavailable,  our  time  model  should  express  the  best 
observable  approximation. 

•  If  the  complete  physical  description  obscures  key  concepts,  then  our  time  model  should 
abstract  to  a  more  appropriate  description. 

•  If  the  processes  collectively  pretend  that  the  “current”  computation  differs  from  the  one 
a  complete  physical  description  would  record,  then  our  time  model  should  express  this 
abstraction. 


2.2.  Description  and  Abstraction 

2.2.1.  Systems 

In  the  theory  of  distributed  time,  we  model  the  system  as  a  collection  of  process  automata  that 
send  and  receive  messages  asynchronously  (and  unreliably).  Each  process  has  a  send  queue  and 
a  receive  queue,  not  necessarily  FIFO.  When  a  process  sends  a  message,  it  appends  the  message 
to  its  send  queue.  At  some  undetermined  time  later,  the  network  removes  the  message  from  the 
send  queue.  Eventually  the  message  may  appear  in  the  receive  queue  of  the  destination  process, 
which  may  then  receive  the  message.  (Thus,  each  message  may  be  received  at  most  once.)  We 
assume  that  each  message  is  sufficiently  distinct  (perhaps  using  identifier  tags)  so  that  we  can 
unambiguously  identify  the  send  corresponding  to  a  given  receive. 
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Each  process  also  has  a  state  transition  rule  8.  The  8  rule  may  differ  at  each  process  and  may 
even  be  nondeterministic  (specifying  a  set  of  possible  new  states  at  each  transition).  We  constrain 
the  8  rule  to  force  processes  to  behave  reasonably  with  respect  to  messages.  That  is,  during  a 
transition,  a  process  may  do  one  of  three  things: 


•  try  to  receive  a  message  from  its  input  queue, 

•  send  a  messt^e,  or 

•  perform  internal  computation. 


Sending  or  receiving  a  message  changes  the  internal  state  at  a  process.  A  process  may  receive 
messages  by  periodically  polling  its  queue,  or  by  continually  looping  on  a  poll  (thereby  blocking) 
until  a  message  is  received.  We  also  permit  interrupt-driven  receive  events:  a  process  may  attempt 
to  perform  an  internal  computation,  with  the  caveat  that  if  a  message  is  present  in  the  input  queue, 
the  process  will  receive  that  instead. 

A  process  operates  in  real  time  and  changes  state  at  indeterminate  intervals.  We  model  this 
behavior  by  saying  that  each  process  has  a  black  box  that  generates  ticks  nondeterministically  (but 
generating  only  finitely  many  in  any  finite  period  of  real  time).  When  receiving  a  tick,  a  process 
transforms  state  instantaneously  according  to  its  state  transition  rule.  If  a  tick  arrives  at  real  time 
u,  the  old  state  persists  for  times  t  satisfying  t  <u\  the  new  state  exists  for  t  >  u. 

A  system  computation  is  what  happens  when  the  processes  are  set  to  their  initial  conditions  and 
fed  with  nondeterministic  ticks. 


2.2.2.  Traces 

Probably  the  most  practical  ground-level  view  of  computation  is  a  linear  trace.  A  trace  is  an 
exhaustive  physical  description  analogous  to  a  movie  reel,  each  frame  stamped  with  a  real  time 
value  and  recording  the  states  of  each  process.  We  require  that  the  cameraman  obtaining  the  trace  to 
be  lucky  but  not  necessarily  regular:  the  interval  of  time  between  frames  need  not  be  constant,  but 
at  least  one  frame  must  be  taken  between  any  two  consecutive  ticks  (or  message  arrivals/departures) 
in  the  system.  Table  I  shows  an  example. 

So  that  traces  have  non-zero  duration,  we  require  that  they  have  at  least  two  frames. 

In  some  sense,  a  trace  is  a  hypothetical  construct,  since  obtaining  one  requires  access  to  the 
complete  physical  state  of  each  process  at  any  instant  in  real  time.  Nevertheless,  traces  serve  as  a 
starting  point,  describing  the  physical  action  in  a  computation. 
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tinte 

f  =  0.0 

t  =  0.8 

t  =  1.5 

t  =  2.2 

t  =  3.0 

t  =  4.2 

t  =  4.8 

p:  state 

initial 

17 

17 

17 

23 

23 

23 

p:  send  queue 

0 

{Af} 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

q:  state 

initial 

initial 

initial 

32 

32 

32 

12 

q:  send  queue 

0 

0 

0 

0 

0 

0 

0 

q:  receive  queue 

0 

0 

0 

0 

0 

[M] 

0 

Table  I  In  this  simple  example  of  a  system  trace, 
process  p  sends  a  message  M  to  process  q. 

2.2.3.  Computation  Graphs 

In  order  to  express  computation  as  a  temporal  relation  on  some  set  of  abstract,  discrete  objects, 
distributed  time  uses  a  computation  graph  format  where  nodes  represent  the  objects,  and  directed 
edges  represent  precedence.  This  construction  is  similar  to  ordered  multisets,  but  allows  us  to 
express  relations  more  general  than  orders,  and  to  use  language  already  in  the  common  parlance  of 
systems  scientists. 


Notation  An  atom  of  a  graph  is  a  node  or  an  edge.  A  minimal  node  in  a  computation  graph 
is  one  that  has  no  node  preceding  it:  a  node  with  in-degree  zero.  Similarly,  a  maximal  node  in  a 
computation  graph  is  one  that  has  no  node  following  it.  We  usually  use  lower-case  Greek  letters 
to  refer  to  computation  graphs,  upper-case  Roman  from  the  front  of  the  alphabet  to  refer  to  nodes, 
and  upper-case  Roman  from  the  end  of  the  alphabet  to  refer  to  sets  of  nodes. 


Precedence  and  Concurrence  For  nodes  A  and  5  in  a  computation  graph,  we  write  A  — ►  B 
to  indicate  that  node  A  precedes  node  B,  and  A  *-f-y  B  to  indicate  that  A  and  B  are  incomparable: 
neither  precedes  the  other.  We  say  that  incomparable  nodes  are  concurrent. 

We  write  A  :=±  B  to  indicate  that  either  A  — ^  B  or  A  =  B. 

The  precedence  relation  specified  by  a  computation  graph  is  an  order  when  it  satisfies  two 
conditions: 


•  The  relation  is  antisymmetric  (or  acyclic):  for  any  A,  B,  if  A  — ►  B  and  B  — >  A  then 
A  =  B. 

•  The  relation  is  transitive:  for  any  A,  B,  C,  if  A  — >  B  and  B  — ►  C  then  A  — ►  C. 
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We  use  graphs  to  permit  temporal  relations  more  general  than  orders.  In  particular,  defining 
precedence  by  edges,  rather  than  paths,  permits  nontransitive  relations 


Prefixes  and  Past-Closure  Suppose  a'  is  a  subgraph  of  computation  graph  a.  We  say  that 
a'  is  a  a  prefix  of  a  when  a'  is  connected  and  contains  all  minimal  nodes  of  a.  We  say  that  a'  is 
past-closed  when  common  nodes  have  the  same  history  in  a  and  a'.  (That  is,  for  any  node  B  in 
a',  if  node  A  precedes  B  in  a,  then  A  exists  and  precedes  B  in  a'.)  The  past-closure  of  a  subgraph 
a'  is  the  the  intersection  of  all  past-closed  subgraphs  of  a  that  contain  a'. 


Ground-level  Computation  Graphs  Directly  transiting  traces  into  computation  graph  for¬ 
mat  yields  ground-level  computation  graphs.  Ground-level  graphs  have  six  types  of  nodes;  a  photo 
node,  representing  the  state  of  a  process  captured  in  a  frame  of  the  trace,  and  nodes  representing 
each  way  that  process  state  might  transform: 

•  when  a  process  je/uir  a  message, 

•  when  a  process  receives  a  message, 

•  when  a  process  computes  something  internally  (i.e.,  a  state  transition  not  involving  input  or 
output), 

•  when  a  message  departs  from  the  send  queue  at  a  process,  or 

•  when  a  message  arrives  at  a  receive  queue  at  a  process. 

We  transform  a  trace  into  its  ground-level  graph  by  constructing  a  photo  node  for  each  process 
in  each  frame  of  the  trace.  Should  two  consecutive  photos  of  a  process  indicate  a  state  change, 
we  insert  the  appropriate  transition  node.  Directed  edges  connect  the  consecutive  nodes  at  each 
process. 

Figure  2. 1  shows  an  example  of  this  construction. 

Representation  Each  atom  in  a  ground-level  computation  graph  represents  some  part  of  the 
computational  space-time  expressed  by  the  trace.  The  space  coordinate  of  the  region  an  atom 
represents  is  determined  by  the  process;  the  process  p  atoms  represent  activity  at  process  p.  The 
time  span  is  determined  by  the  following  rules; 

•  Each  photo  node  represents  the  instant  in  time  of  that  frame. 

•  Each  transition  node  represents  the  unknown  instant  in  time  the  transition  occurred. 

•  Each  directed  edge  between  two  nodes  represents  the  open  interval  between  the  instants 
represented  by  the  endpoints. 
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Rgure  2.1  A  ground-level  computation  graph  is  the  lowest  level  abstraction  of  a 
computation.  This  sketch  shows  the  ground-level  graph  for  the  computation  whose 
trace  appears  in  Table  I. 


Figure  2.2  shows  an  example  of  this  representation. 


Event  vs.  State  Considering  how  to  build  computation  graphs  brings  up  an  important  question 
[Pr92]:  should  the  fundamental  object  (the  nodes)  represent  events  or  states?  Should  the  main 
unit  of  description  be  the  dynamic  “thing  that  happens”  at  a  process,  or  the  static  “interval  of 
holding  a  specified  bit-pattern?”  Ground-level  graphs  admit  both  types  of  objects:  photo  nodes 
describe  process  state,  while  the  other  nodes  describe  inferred  (rather  than  directly  observed)  state 
transitions. 

Each  approach  can  be  useful,  and  the  distributed  time  formalism  supports  both. 


2.2.4.  Time  Models 

Representative  lYansformations  A  ground-level  graph  provides  too  much  detail.  A  time 
model  is  a  mechanism  to  generate  more  abstract  descriptions.  Formally,  time  moctels  are  represen¬ 
tative  transformations  on  computation  graphs.  This  description  highlights  the  two  key  properties: 

•  Transformation  A  time  model  M  is  a  partial  function  on  computation  graphs.  Applying 
M  to  a  graph  a  (for  which  M  is  defined)  produces  a  new,  more  abstract  graph  M(o). 

•  Representation  If  model  M  is  defined  on  graph  q,  each  atom  of  M(o)  may  represent 
atoms  in  the  original  graph  q.  However,  this  representation  may  be  a  Chicago-style  democ¬ 
racy:  some  atoms  of  M(a)  may  represent  no  one,  and  some  atoms  of  a  may  have  multiple 
representatives.  We  formalize  this  arrangement  by  saying  that  the  application  of  M  to  o 
induces  a  representation  map  from  the  atoms  of  M{a)  to  sets  of  atoms  of  a.  This  map,  which 
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Figure  2.2  Each  atom  in  a  ground-level  computation  graph  represents  part  of 
the  space-time  region  in  which  the  computation  occurs.  This  diagram  shows  how 
the  process  p  part  of  the  ground-level  graph  from  Figure  2.1  partitions  the  time 
experience  of  process  p  in  the  computation  from  Table  I. 


we  denote  as  ( M,  a ),  takes  each  atom  in  the  new  graph  to  the  set  of  atoms  it  represents  in 
the  original  graph. 


Figure  2.3  shows  an  example  of  this  relationship. 


Composition  and  Hierarchies  The  functional  nature  of  time  models  allows  us  to  compose 
them.  This  allows  us  to  place  models — and  computation  graphs — into  hierarchies.  For  example. 
Ml  might  take  a  set  of  ground-level  computation  graphs  Qo  to  a  set  of  more  abstract  graphs  Q\. 
This  abstraction  might  lose  information,  in  the  sense  that  Mi  might  take  several  graphs  in  Qq  to  a 
single  graph  in  Q\.  Model  M2  may  abstract  from  ^1  to  the  composition  model  M2  0  M|  takes 
the  ground-level  graphs  directly  to  ^2- 

The  representation  map  for  composed  models  follows  naturally.  For  a  computation  graph  q, 
let  /3  =  M|(a)  and  7  =  M2(j5)  =  (M2  0  Mi)(o).  We  find  out  what  an  atom  A  in  7  represents 
under  the  representation  map  for  M2  o  M 1  by  the  following  steps; 

1 .  We  apply  the  map  for  M2  to  find  out  what  A  represents  in  /i. 

2.  We  then  apply  the  map  for  Mi  to  find  out  what  each  atom  in  this  set  represents  in  a. 

3.  We  take  the  union  of  the  results  of  this  second  round. 
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Figure  2.3  Time  model  M  transforms  computation  graph  a  to  computation  graph 
M{q).  The  representation  map  ( M,  a )  takes  each  atom  of  M(a)  back  to  the  set  of 
atoms  in  a  it  represents.  The  bold  arrow  indicates  the  action  of  M;  the  gray  arrows 
indicate  the  action  of  ( M,  a ). 

We  state  this  formally  in  the  equation: 

(M2oM,,a)(A)  =  U  {Mu  a)iB) 

Figure  2.4  illustrates  this  construction. 

Refinement  Suppose  two  models  Mi, M2  have  the  property  that  for  all  computation  graphs 
Q  and  a',  Mi(a)  =  Mi(a')  implies  M2(a)  =  M2(a').  Knowing  the  Mi  image  of  a  graph  is 
sufficient  to  determine  the  M2  image.  We  say  that  Mi  refines  to  M2,  and  write  Mi  >  Mi. 

When  Ml  >  M2,  model  M2  provides  a  more  abstract  view  of  the  underlying  computation,  but 
in  a  way  that  is  still  well-defined  in  terms  of  the  view  Mi  provides. 

Abstraction  Hierarchies  Refinement  is  clearly  transitive.  This  fact  allows  us  to  put  models 
into  abstraction  hierarchies:  chains  of  models  that  successively  refine  to  each  other. 

2.2.5.  Properties  of  Time  Modeis 

We  now  define  several  time  model  properties  that  we*  will  use  in  this  thesis.  Temporal  relations 
determine  two  of  these  properties: 

•  A  model  is  transitively  bounded  when  its  transitive  closure  has  a  unique  maximum  node  and 
a  unique  minimum  node. 

•  A  model  is  acyclic  when  its  transitive  closure  has  no  cycles.  (A  node  in  a  graph  is  acyclic 
when  it  does  not  precede  itself.) 
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Rgure  2.4  To  obtain  the  image  of  graph  a  under  the  composition  M2  o  M|,  we 
first  obtain  0  =  Mi(a),  and  then  obtain  7  =  To  apply  the  representation 

map  ( M2  o  Ml,  a )  to  an  atom  in  7,  we  first  apply  ( M2,  ^){A)  to  that  atom.  We 
then  apply  ( Mi,  a )  to  each  atom  in  the  resulting  subset  of  13,  and  take  the  union 
of  the  result.  In  this  diagram,  solid  arrows  indicate  the  action  of  the  time  models; 
gray  arrows  indicate  the  action  of  the  representation  maps. 
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The  remaining  properties  involve  relating  the  graphs  a  model  produces  to  the  computation 
underlying  this  graph  (through  the  trace  and  the  ground-level  computation  graph.)  First,  moving 
from  an  M  graph  back  to  the  static  underlying  computation  allows  us  to  define  two  properties; 

•  We  say  a  time  model  M  is  flow-supported  when  transitive  precedence  implies  information 

flow.  Every  precedence  path  is  supported  by  potential  information  flow.  That  is,  suppose  that 
A  and  5  are  two  nodes  in  a  graph  produced  by  M,  and  that  ground-level  graph  a  satisfies 
M(a)  =  fl.  If  A  *  B  in  then  an  information  flow  path  exists  from  the  space-time  region 

A  represents  (through  a)  to  the  space-time  region  that  B  represents  (through  q). 

•  We  say  a  time  model  M  is  flow-virtual  when  information  flow  does  not  necessarily  im¬ 
ply  precedence.  Such  a  model  may  express  the  information  flow  in  a  simulated  virtual 
computation. 

We  also  need  to  move  from  an  M  graph  back  to  the  dynamic  underlying  computation.  A  trace  of 
a  computation  corresponds  to  a  ground-level  graph.  A  computation  in  progress  induces  a  sequence 
of  increasing  finite  traces;  hence  we  can  think  of  an  unfolding  computation  as  the  sequence  of 
ground-level  graphs 

[o^i]  OO,  O'!, ...,  Q!,', ... 

corresponding  to  this  sequence  of  traces.  For  a  graph  0  produced  by  a  time  model  M,  we  can 
define  the  set  of  ground-level  graph  sequences  corresponding  to  unfolding  computations  that, 

at  some  point,  generate  0  through  M. 

^tA,0  =  {[o.]  :  3fcM(a;fe)=/3} 

We  use  this  set  to  define  several  types  of  monotonicity: 

•  A  model  M  is  node-monotonic  when,  for  any  graph  0  it  produces,  each  node  in  0  never 
vanishes  once  it  exists. 

V nodes  A  in  /?  V  [o.]  e  Sm,0  3  ^  Vy  :  A  e  M(a,)  U  >  k) 

•  A  model  M  is  weakly  edge-monotonic  when,  for  any  graph  0  it  produces,  each  edge  in  0 
never  vanishes  once  it  exists. 

Vedges  Ein0  V [a.]  €  Sm,0  3k  Vj  :  Ee  Miaj)  ^  {j  >  k) 

•  A  model  M  is  strongly  edge-monotonic  when  an  edge  exists  between  two  nodes  in  0  only  if 
it  always  exists  in  all  graphs  containing  those  two  nodes. 

V  nodes  A,B  \n0  V  [a^]  G  Su,0  V  j  : 

A,B€M(a;)  ((A — >  BinM(Qj))  {A — *  B  in  0)) 

•  A  model  is  weakly  monotonic  when  it  is  node-monotonic  and  weakly  edge-monotonic. 

•  A  model  is  strongly  monotonic  when  it  is  node-monotonic  and  strongly  edge-monotonic. 


2.2.6.  Parallel  Pairs 


Frequently  the  single  perspective  from  a  single  time  model  is  not  sufficient.  A  distributed  system 
provides  two  simple  examples: 

•  We  may  want  to  distinguish  between  “node  B  happened  immediately  after  node  A”  and 
“node  B  happened  after  node  A” 

•  We  may  want  to  distinguish  between  the  ordering  of  nodes  at  an  individual  process  (the  local 
computation)  and  the  system-wide  ordering  of  nodes  at  all  processes  (the  global  computa¬ 
tion). 

Distributed  time  allows  such  multiple  perspectives. 

Composition  allows  us  to  distinguish  between  the  basic  steps  in  a  computation — e.g.,  events 
A,  B,  C  happened  in  sequence — ^and  the  general  ordering.  We  simply  build  our  model  M  to  draw 
edges  for  the  basic  steps  and  build  a  standard  model  TRANS  to  take  the  transitive  closure  of  a  graph. 
Then  we  can  talk  about  basic  steps  using  M,  and  full  transitive  precedence  using  M  =  TRANS  o  M. 

Events/states  in  a  distributed  system  can  exist  and  be  ordered  on  two  levels:  locally,  in  the 
timelines  of  their  processes,  and  globally,  in  terms  of  the  entire  system.  A  graph  describing  the 
local  computations  clearly  relates  to  a  graph  describing  the  global  computation:  join  the  local 
graphs  “in  parallel,”  merge  some  events,  and  possibly  add  some  edges. 

A  parallel  pair  is  such  a  pair  of  models  (M,  M').  Both  models  act  on  ground-level  graphs. 
Model  M  produces  a  graph  describing  the  global  computation;  model  M'  produces  a  graph  com¬ 
posed  of  disjoint  straightline  graphs,  each  describing  a  local  process  timeline.  The  models  in  a 
parallel  pair  must  satisfy  one  additional  rule:  minimal  events  at  processes  must  correspond  to 
minimal  events  in  the  global  graph,  and  similarly  for  maximal  events.  When  M'  is  the  local  model 
in  a  parallel  pair,  we  denote  its  process  p  component  by  Xp  M'. 

The  two  models  in  a  parallel  pair  must  closely  correlate.  This  closeness  allows  us  to  define  a 
time  model  taking  graphs  produced  by  the  local  model  to  graphs  produced  by  the  global  one.  We 
call  this  model  the  factoring  model  M/M' .  The  factoring  model  satisfies  the  equation: 

M  =  (M/M')  o  M' 

Figure  2.5  illustrates  the  four  perspectives  that  a  parallel  pair  provides. 


2.2.7.  Properties  of  Parallel  Pairs 

Time  model  properties  directly  lead  to  several  parallel  pair  properties: 
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Rgure  2.5  A  time  model  generates  an  abstract  view  of  a  computation.  A  parallel 
pair  generates  four  views,  according  to  the  two  independent  choices:  whether  we 
use  the  process  timelines  or  the  overall  system  graph,  and  whether  we  consider 
basic  transitions  or  transitive  precedence.  Here,  the  parallel  pair  (M,  M')  acts  on 
ground-level  a,  the  computation  graph  corresponding  to  system  trace  T.  The  local 
model  M'  takes  a  to  ^  =  M'(a),  the  collection  of  process  timelines.  The  global 
model  M  takes  a  to  7  =  M(a),  the  overall  system  description.  We  can  take  the 
transitive  closure  of  either  of  these  graphs— and  of  either  of  these  models.  The 
graph  /3  =  M'(a)  egresses  the  full  transitive  relation  induced  by  the  basic  steps 
in  (3]  the  graph  7  =  M(a)  expresses  the  full  transitive  relation  induced  by  the  basic 
steps  in  7.  The  factoring  model  M/M'  takes  the  M'  image  to  the  M  image;  the 
factoring  model  M/M'  takes  the  M'  image  to  the  M  image. 
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•  A  parallel  pair  is  transitively  bounded  when  its  global  model  is  transitively  bounded. 

•  A  parallel  pair  is  acyclic  when  its  global  model  is  acyclic. 

•  A  parallel  pair  is  weakly  monotonic  when  the  transitive  closure  of  each  model  is  weakly 
monotonic. 

•  A  parallel  pair  is  strongly  monotonic  when  the  transitive  closure  of  each  model  is  strongly 
monotonic. 

•  A  parallel  pair  is  flow-supported  when  each  model  is  flow-supported. 

•  A  parallel  pair  \s  flow-virtual  when  the  transitive  closure  of  each  model  is  flow-virtual. 

The  pairing  of  time  models  leads  to  other  properties; 


•  Each  atom  at  a  process  affords  some  view  of  the  activity  at  the  other  processes.  Two  such 
atoms  at  a  process  are  externally  equivalent  when  they  afford  the  same  view:  either  both 
are  cyclic  or  both  are  acyclic,  and  both  have  the  same  transitive  global  relation  to  each  node 
at  all  other  processes.  A  graph  a  from  the  global  model  in  a  parallel  pair  is  view-complete 
when  any  edge  at  any  process  has,  in  a,  an  externally  equivalent  node  at  that  process.  That 
is,  if  any  basic  step  at  a  process  affords  some  external  view  in  the  transitive  global  graph,  a 
node  exists  at  that  process  giving  the  same  view.  A  parallel  pair  (M,  M')  is  view-complete 
when  all  graphs  produced  by  the  global  model  M  are  view-complete. 

•  A  consistent  parallel  pair  is  one  that  is  view-complete  and  transitively  bounded. 

•  In  an  independent  parallel  pair,  each  non-extremal  node  in  the  global  model  represents  a 
unique  node  in  the  process  model. 


lypes  of  Parallel  Pairs  This  thesis  will  focus  primarily  on  parallel  pairs  of  four  types: 

•  Type  1:  those  that  are  consistent; 

•  Type  2:  those  that  are  consistent  and  independent; 

•  Type  3\  those  that  are  strongly  monotonic,  consistent,  and  independent; 

•  Type  4:  those  that  are  flow-supported,  strongly  monotonic,  consistent,  and  independent. 

For  n  G  {1,2,3},  any  Type  n  pair  is  a  Type  n  -  1  pair,  but  some  Type  n  -  1  pairs  may  not 
necessarily  be  Type  n  pairs. 

We  will  also  consider  independently  when  a  parallel  pair  is  flow-virtual. 
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The  TVpe  1  and  Type  2  conditions  describe  the  internal  structure  of  time  models.  The  Type  3 
and  Type  4  conditions  describe  properties  useful  for  specifying  clocks  for  these  models.  The 
flow-virtual  condition  will  be  useful  in  considering  security  properties  of  these  clocks. 

When  a  parallel  pair  (M,  M')  is  Type  2,  we  will  informally  identify  the  nodes  in  the  M  graph 
with  their  mates  in  the  M'  graph. 


2.2.8.  Nonlinear  Pairs 

A  nonlinear  pair  is  a  pair  of  models  (M,  M')  that  meets  the  definition  of  parallel  pair,  except  for  the 
requirement  that  M'  produce  straightline  graphs.  The  definitions  of  Section  2.2.6  and  Section  2.2.7 
iq)ply  to  nonlinear  pairs  as  well. 


2.2.9.  Examples 

Thinking  about  time  in  asynchronous  distributed  computations  as  a  partial  order,  determined  by  the 
asynchrony  and  distribution,  holds  a  number  of  advantages  over  thinking  of  time  as  a  total  order, 
determined  by  real  time.  This  section  develops  time  models  to  transform  ground-level  computation 
graphs  to  graphs  depicting  their  natural  partial  order  time  descriptions. 

Partial  order  time  abstracts  away  irrelevant  temporal  detail.  As  we  shall  see  in  subsequent 
chapters,  frequently  we  need  to  abstract  away  irrelevant  computational  detail  as  well — deriving 
temporal  relations  more  general  than  the  standard  partial  order,  as  well  as  deriving  instances  of  the 
standard  partial  order  that  do  not  arise  directly  from  the  actual  computation. 

This  section  proceeds  by  removing  the  irrelevant  detail  of  the  network  activity,  and  then  building 
a  partial  order  time  model  that  will  be  standard  for  this  thesis. 


Abstracting  Away  Network  Activity  The  goal  of  partial  order  time  is  to  express  the  temporal 
ordering  perceived  by  the  processes  themselves.  The  first  step  toward  building  such  models  consists 
of  abstracting  away  details  imperceivable  by  the  processes:  the  state  and  transformations  of  their 
queues.  Thus  we  begin  by  defining  the  NET.  abstract  time  model  which  acts  on  ground-level 
computation  graphs. 

The  NET- ABSTRACT  model  abstracts  away  network  activity  as  follows.  In  a  ground-level  graph, 
the  photo  nodes  record  both  the  automata  state  and  the  queue  state  at  the  process.  For  each  photo 
node,  we  retain  only  the  recorded  automata  state.  We  delete  the  nodes  marking  arrive  transitions 
and  depart  transitions.  For  each  process,  the  nodes  in  the  net  .abstract  image  correspond  to  a 
subsequence  of  the  nodes  in  the  ground-level  graph;  we  draw  edges  connecting  the  nodes  in  this 
sequential  order. 
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Basically,  the  net,  abstract  image  of  a  graph  consists  of  a  copy  of  the  original  graph,  with  the 
photo  nodes  relabeled  and  the  irrelevant  transition  nodes  deleted.  The  representation  map  follows 
this  description.  Let  q  be  a  ground-level  grajA,  and  ^  =  net.abstract(q).  The  representation 
map  ( NET- abstract,  o)  takes  each  atom  in  $  to  its  original  image  in  a,  with  one  exception — 
deleted  transition  nodes.  Suppose  node  >1  in  a  is  a  transition  node  that  NET.  abstract  deletes. 
Let  E\  and  Ei  be  the  edges  incident  to  a.  Let  E  be  the  edge  in  l3  where  A  would  have  been.  Then 

( NET- ABSTRACT,  q)(£)  =  {Ei,A,E2} 


Figure  2.6  shows  how  the  NET- abstract  model  applies  to  the  sample  ground-level  computa¬ 
tion  graph  from  Figure  2.1.  Figure  2.7  clarifies  the  representation  map. 

Timelines  The  timelines  model  organizes  individual  process  activity  into  linear  timelines.  We 
obtain  the  TIMELINES  image  of  a  ground-level  computation  graph  a  in  several  steps: 

•  We  apply  NET- abstract  to  a.  Let  ,5  be  the  resulting  graph. 

•  At  each  process,  we  create  a  ±  node  for  the  first  photo  node  in  and  a  T  node  for  the  last 
photo  node  in 

•  We  copy  each  send  and  receive  node  in  /3. 

•  Removing  the  send  nodes,  receive  nodes,  and  extremal  photo  nodes  from  d  would  leave 
us  with  a  collection  of  maximal  connected  sequences  of  atoms,  each  occurring  at  only  one 
process.  For  each  such  sequence,  we  create  a  state  node  reflecting  the  process  state.  (This 
state  is  well-defined:  each  sequence  will  have  at  least  one  photo  node,  except  possibly  the 
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Rgure  2.6  The  net  .abstract  model  removes  irrelevant  network  detail.  This 
computation  graph  shows  the  result  of  applying  net  .abstract  to  the  graph  of 
Figure  2.1. 
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Rgura  2.7  Representation  under  the  net.  abstract  model  is  practically  the  iden¬ 
tity.  This  diagram  shows  how  atoms  in  the  process  p  part  of  the  net  .abstract 
graph  of  Rgure  2.6  represent  atoms  in  the  process  p  part  of  the  ground-level  graph 
from  Rgure  2.1 .  The  darker  gray  arrows  leading  to  the  deleted  depart  node  are  the 
only  significant  change. 


first  and  last  sequences  at  a  process.  These  extremal  sequences  will  pick  up  their  values  from 
±  and  T.) 

•  We  connect  consecutive  nodes  at  each  process  with  directed  edges. 


Figure  2.8  sketches  this  construction. 

Representation  follows  from  this  construction.  Suppose  q  is  a  ground-level  graph,  and  we 
apply  the  time  models  to  obtain  the  graphs: 


=  NET_ABSTOACT(a) 

7  =  TIMELINES(q) 

Each  node  A  in  7  replaces  a  sequence  5  of  atoms  in  Node  A  represents  in  q  the  union  of  what 
the  elements  of  S  represent. 


(timelines,  a  )( A)  =  (J  (  NET. ABSTRACT,  a)(B) 

B€S 


Figure  2.9  sketches  this  relation. 

For  process  p,  the  model  timelines  p  produces  only  the  timeline  belonging  to  process  p. 
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Rgura  2.8  The  timelines  model  produces  this  graph  when 
applied  to  the  ground-level  graph  of  Figure  2.1 . 
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Figure  2.9  Each  node  in  the  timelines  model  represents  a  sequence  of  atoms  in 
the  original  ground-level  graph.  This  diagram  shows  how  atoms  in  the  process  p 
part  of  the  timelines  graph  of  Figure  2.8  represents  atoms  in  the  process  p  part  of 
the  ground-level  graph  from  Figure  2.1 . 
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Th«  Partial  Order  The  parhal-ORDER-TIME  model  organizes  the  timelines  of  timelines 
into  a  system-wide  partial  order.  We  obtain  the  partial  .order .time  image  of  a  ground-level 
computation  graph  a  in  several  steps: 

•  We  apply  hmeunes  to  a. 

•  We  merge  the  ±  nodes  into  a  single  global  X  node. 

•  We  merge  the  T  nodes  into  a  single  global  T  node. 

•  For  each  received  message,  we  draw  a  directed  edge  from  its  send  node  to  its  receive  node. 

Representation  follows  directly  from  hmeunhs  representation:  the  1  and  T  nodes  represent  the 
union  of  what  the  merged  nodes  represent,  and  the  message  edges  represent  nothing.  Figure  2. 10 
sketches  this  construction. 

Since  a  trace  must  have  at  least  two  frames,  we  observe  that  the  minimal  partial  .  order  .  time 
graph  consists  of  X,  T,  and  a  state  node  for  each  process. 

This  construction  ensures  that  a  process  cannot  have  two  consecutive  “external"  nodes  (that  is, 
extremal  or  message  event  nodes). 

The  models  (partial -ORDER -TIME,  timelines)  form  a  Type  4  parallel  pair:  consistency,  in¬ 
dependence,  flow-support,  and  strong  monotonicity  are  all  easily  established.  Indeed,  we  could 
define  flow-support  in  terms  of  PARTIAL -Order -TIME:  a  graph  M(a)  is  flow-supported  iff 

A  — >B  in  M(a)  =>  A  — ►  B  in  partial- ORDER -TIME(q) 

The  construction  of  partial  -  order  -  TIME  naturally  suggests  how  to  obtain  the  factor!  ng  model 
PARTIAL -ORDER. TIME/TIMELINES  . 


initial  send  state  internal  state 

state  M  17  computation  23 


initial  internal  state  receive  state 

state  computation  32  M  12 

Rgure  2.10  The  partial. order. time  model  produces  this  graph 
when  applied  to  the  ground-level  gr^h  of  Figure  2.1 . 
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The  transitive  closure  TIMELINES  builds  a  total  order  on  the  nodes  at  each  process.  The  transitive 
closure  PARIIAL-ORDER.TIME  builds  a  partial  order  on  the  nodes  at  all  processes. 


2.3.  Timeslices  and  Global  States 


This  section  discusses  how  the  mechanics  of  distributed  time  extend  to  handle  the  problems  of  real 
and  apparent  simultaneity  in  asynchronous  distributed  systems.  Section  2.3. 1  defines  timeslices  in 
computation  graphs.  Section  2.3.2  discusses  global  states  in  computations.  Section  2.3.3  discusses 
the  relation  between  global  states  and  timeslices,  and  Section  2.3.4  discusses  the  finer  structure  of 
timeslices. 


2.3.1.  Timeslices 

We  construct  time  models  to  package  periods  of  activity  at  processes  into  events  or  states,  which 
appear  in  the  computation  griqph  as  nodes.  Two  nodes  that  a  computation  graph  leaves  unordered  are 
logically  concurrent,  in  that  the  graph  does  not  specify  one  happening  before  another.  A  maximal 
set  of  mutually  concurrent  nodes  represents  a  logical  slice  of  time  across  this  computation;  this 
meaning  follows  naturally  from  the  semantics  of  the  computation  graph:  any  other  node  must 
happen  either  before  or  after  some  node  in  the  set. 

We  define  a  timeslice^  to  be  a  maximal  mutually  concurrent  set  of  nodes.  That  is,  X  is  a 
timeslice  iff  X  satisfies  two  conditions: 


1.  X  is  mutually  concurrent:  no  A,  B  €  X  satisfy  A  — >  B,  and 

2.  X  is  maximal:  no  mutually  concurrent  Y  exists  properly  containing  X. 


This  definition  of  mutually  concurrent  automatically  prohibits  cyclic  events  from  timeslices. 

A  partial  timeslice  is  a  subset  of  a  timeslice — that  is,  a  set  of  mutually  concurrent  nodes  that  is 
not  necessarily  maximal.  (If  the  precedence  relation  from  a  computation  graph  were  guaranteed  to 
be  an  order,^  then  a  partial  timeslice  is  simply  an  antichain.) 


'Spezialetti  (Sp89]  uses  the  terai  “timeslice,”  and  Mattem  [Ma89I  uses  “time  slice”;  the  timeslices  there  are  special 
cases  of  the  timeslices  here. 

^Section  2.2.3  presented  a  formal  definition. 
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2.3.2.  Global  States 


The  Physical  Computation  In  the  physical  system,  computation  takes  place  in  the  space-time 
region  consisting  of  the  cross  product  of  the  set  of  processes  with  a  continuous  intet  val  of  real  time. 
A  physical  global  state  consists  of  the  state  of  the  entire  system  at  some  point  in  time — that  is,  a 
slice  of  the  space-time  region. 


Ground-Lavai  Grapha  Each  atom  of  a  ground-level  computation  graph  a  implicitly  represents 
some  subset  of  the  computation  space-time  region.  The  collection  of  subs.":ts  represented  by  all  the 
atoms  in  a  constitutes  a  partition  of  the  space-time  region.  For  the  space-time  slice  corresponding 
to  a  physical  global  state,  we  can  find  sets  of  atoms  from  the  ground-level  graph  a  that  represent  a 
subset  of  the  space-time  region  that  contains  this  slice.  (For  a  trivial  example,  consider  the  set  of 
all  the  atoms  in  the  graph.)  We  say  that  a  set  X  of  atoms  of  ground-level  a  is  a  global  state  when 
it  is  the  minimal  subset  representing  a  slice:  when  X  contains  the  slice  but  no  proper  subset  of  .V 
does. 


Abstract  Graphs  Suppose  time  model  M  is  defined  for  ground-level  graphs.  A  computation 
graph  0  produced  by  M  is  supposed  to  “forget”  which  ground-level  graph  generated  it.  The  graph 
0  is  also  supposed  to  express  the  objects  of  interest  as  nodes.  The  model  M  has  an  explicit 
representation  map  to  tell  us  what  these  nodes  represent  in  pre-images  of  /3.  To  talk  about  global 
states  in  0,  we  want  to  talk  about  three  aspects: 

•  a  set  of  nodes  in  0 

•  that  minimally  represents  a  ground-level  global  state 

•  in  some  ground-level  graph  that  M  transforms  to  0. 

Formally,  suppose  that  time  model  M  is  defined  on  ground-level  graphs.  A  graph  3  that  M 
produces  is  the  M  image  of  at  least  one  ground-level  graph  a.  A  set  X  of  nodes  in  0  minimally 
represents  a  global  state  when  some  ground-level  graph  a  exists  satisfying  the  conditions: 

•  M(a)  =  0-, 

•  X  represents  a  global  state  K  in  a: 

K  C  y(M,Q)(/i) 

A£X 

•  however,  no  proper  subset  of  X  represents  Y. 

Figure  2. 1 1  illustrates  how  node  sets  from  higher-level  graphs  correspond  to  global  states. 
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initial  send  state  internal  state 


time 


Figure  2.11  Global  states  arise  from  real  simultaneity.  Here,  the  region  Z  in  the 
space-time  diagram  at  the  bottom  indicates  the  activity  at  time  <  =  l.9.  The  atom  set 
Y  in  the  ground-level  graph  in  the  middle  is  the  minimal  set  mapping  to  this  instant, 
and  thus  is  a  global  state.  The  node  set  X  in  the  partial  .order  .time  graph  at  top 
minimally  represents  the  global  state  Y.  (The  set  .Y  also  is  a  timeslice.) 
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2.3.3.  The  Relation  Between  Timeslices  and  Global  States 


A  ground-level  graph  a  expresses  the  physical  computation.  Its  abstraction  under  the  time  model 
M  is  the  graph  0  =  M(a).  In  general,  time  models  will  not  be  injective:  many  ground-level  graphs 
may  map  to  0  under  M.  If  the  set  of  ground-level  graphs  describes  “possible”  computations,  the 
set 


Q  =  {M(a)  :  Q  is  a  ground-level  graph} 
describes  the  possible  computations  when  viewed  through  the  model  M. 


The  Timeslice  Condition  If  M  is  well-constructed,  then  the  timeslices  in  a  graph  that  M 
generates  represent  exactly  the  significant  global  states  in  the  physical  computations  from  which 
this  graph  abstracts.  Formally,  suppose  time  model  M  on  ground-level  graphs  generates  the  set 
Q.  We  develop  criteria  for  a  model  to  have  timeslices  with  the  appropriate  semantics.  Model  M 
satisfies  the  Timeslice  Condition  iff  for  each  0  satisfies  these  requirements: 

1.  For  each  set  X  of  nodes  in  0,  the  following  are  equivalent  statements: 

•  X  minimally  represents  a  global  state  K  in  some  ground-level  graph  a  withM(Q)  =  3. 

•  X  is  a  timeslice  in  '0. 

2.  Each  ground-level  graph  a  with  /3  =  M(a)  and  each  global  state  K  in  a  satisfy  the  statement: 

•  If(M,  Q)(i4)nK  0  for  some  node  A,  then  some  timeslice  in/?  minimally  represents 
Y. 

Some  partial  order  models  fail  to  meet  the  Timeslice  Condition.  For  example,  a  version  of 
PARTIAL -ORDER -TIME  that  omitted  the  state  nodes  would  fail:  if  a  process  p  executes  a  receive 
immediately  after  a  send,  then  global  states  corresponding  to  the  real  time  interval  between  those 
events  cannot  be  represented  by  timeslices.  Figure  2.12  sketches  an  example. 

The  view-completeness  property  from  Section  2.2.6  prevents  these  scenarios  where  timeslices 
cannot  extend  to  all  processes. 

Theorem  2.1  Suppose  (M,  M')  is  a  Type  1  parallel  pair.  Then  all  timeslices  in  M 
touch  every  process. 


Proof  Suppose  timeslice  X  does  not  touch  process  p.  Let  A  be  the  maximal  node  at  p  that 
precedes  or  equals  some  node  in  X.  Let  B  be  the  minimal  node  at  p  that  follows  some  node 
in  X.  We  must  have  A  — >  B,  for  if  B  A  then  X  could  not  be  a  timeslice.  All  nodes  and 
edges  between  A  and  B  must  be  mutually  concurrent  with  each  node  in  X.  Further,  the  first 
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Rgure  2.12  Consider  the  partial  order  produced  by  the  transitive  closure  of  this 
event-only  graph.  Edge  E  at  process  p  is  concurrent  with  both  Ri  and  Si  at  process 
q.  However,  ail  nodes  at  process  p  either  precede  Ri  or  follow  Si.  Consequently, 
this  graph  is  not  view-complete.  As  a  result,  no  timesiice  can  minimally  represent 
a  global  state  containing  Ri  or  Si,  since  any  corresponding  process  p  node  will  not 
be  concurrent. 

and  last  edges  must  be  acyclic  (otherwise  A  would  advance  and/or  B  would  move  back).  View- 
completeness  gives  the  existence  of  nodes  at  p  with  the  same  properties,  thus  X  could  not  have 
been  a  timesiice.  □ 

By  including  state  nodes,  the  partial. ORDER-TIME  model  of  Section  2.2  is  easily  view- 
complete  and  thus  consistent.  The  construction  of  partial  .order  .time  provides  some  additional 
properties: 

•  Precedence  of  two  nodes  in  PARTIAL .  ORDER. TIME  implies  real-time  precedence  of  the  ac¬ 
tivity  those  nodes  represent  in  any  underlying  computation. 

•  Each  node  in  partial.  ORDER  .time  represents  a  connected  region  of  activity  at  a  process. 

•  The  activity  of  each  process  any  point  in  time  is  represented  by  some  node  in 
PARTIAL -ORDER-TIME. 

These  properties  serve  to  establish  the  following  result: 

Theorem  2.2  The  partial -Order .time  model  satisfies  the  Timesiice  Condition. 

Proof  Let  /3  be  a  graph  generated  by  partial -ORDER .time. 

Suppose  node  set  X  is  not  a  timesiice.  Then  either  X  does  not  touch  every  process  (in  which 
case  it  cannot  represent  a  global  state),  or  X  is  not  mutually  concurrent  (in  which  case  one  node  of 
X  must  precede  another  in  and  thus  in  real  time  in  all  traces). 
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Suppose  node  set  X  is  a  timesHce.  We  construct  a  trace  where  the  activities  X  represents  are 
simultaneous.  Assign  integers  to  the  nodes  of  0  by  hrst  setting  each  node  of  X  to  0,  then  setting 
each  node  A  following  to  be  one  greater  than  the  maximum  value  of  its  predecessors  and  each 
A  preceding  A  to  be  one  less  than  the  maximum  value  of  its  successors.  If  ~j  is  the  value  on  1. 
add  j  to  each  value.  A  trace  exists  that  schedules  each  instantaneous  node  (1,  T,  and  transition 
nodes)  labelled  i  at  t  =  i,  and  each  state  node  labelled  i  in  the  open  interval  {i  —  1 ,  i  +  1 ).  Then 
timeslice  A  describes  the  state  of  the  system  att  =  j. 

In  any  computation  generating  0,  a  physical  global  state  generates  a  node  set  A  in  /?  touching 
every  process.  □ 


2.3.4.  The  Structure  of  Timeslices 

By  definition,  a  timeslice  is  maximal  set  of  mutually  concurrent  nodes.  What  do  these  timeslices 
look  like?  If  acyclic,  the  singletons  {T}  and  {±}  are  trivially  timeslices:  no  concurrent  nodes 
exist.  What  about  the  other  timeslices? 

Naively,  a  timeslice  should  consist  of  one  node  per  process.  In  general  models,  nodes  may 
represent  activity  at  multiple  processes.  Hence  in  general,  the  informal  “one-per-process”  tuple 
has  two  formal  characterizations: 

•  as  a  vector — ^an  array  of  nodes,  with  the  constraint  that  the  process  p  entry  occurs  at  p\  and 

•  as  a  cut — a  set  of  nodes  that  contains,  for  each  process  p,  exactly  one  node  occurring  at  p. 

A  cut  is  the  node  set  of  a  unique  vector,  but  the  node  set  of  an  arbitrary  vector  is  not  necessarily 
a  cut.  In  either  case,  we  can  use  projection  to  isolate  particular  entries — e.g.,  TTp  A  is  the  process  p 
entry  of  A. 

The  literature  uses  consistent  cut  for  a  cut  that  is  also  a  timeslice  in  the  global  model.  If  the  node 
set  of  a  vector  is  a  timeslice,  then  it  is  also  a  consistent  cut  (because  in  parallel  pairs,  the  local  process 
models  are  total  orders,  so  distinct  nodes  at  the  same  process  cannot  be  concurrent).  However,  not 
all  timeslices  will  be  consistent  cuts — Figure  2.12  shows  a  counter-example.  View-completeness 
eliminates  this  problem  for  our  partial  order  models,  as  Theorem  2. 1  showed.  View-completeness 
also  provides  a  convenient  extension  property  for  partial  timeslices. 

Corollary  2.3  Let  (M,  M')  be  aTypc  1  parallel  pair.  Any  set  of  mutually  concurrent 
nodes  from  M  extends  to  a  full  consistent  cut. 

The  timeslices  in  a  partial  order  graph  will  be  the  extrema  singletons  (which  are  easily  consistent 
cuts,  since  the  extrema  represent  every  process),  and  the  sets  consisting  of  a  non-extremal  node 
from  each  process. 
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Timestamp  Vectors  Let  (M,  M')  be  a  parallel  pair.  We  define  the  timestamp  vector  for  a  node 
A  to  be  the  vector  V(  A)  consisting  of  the  maximal  node  at  each  process  that  precedes  or  equals  A 
in  the  global  model.  That  is,  let  B  be  the  process  p  entry  Vj,  V(A).  Then  B  =l  A  in  M,  and  each 
node  C  at  each  process  p  satisfies 

C  A  in  M  =»  C'  ■=*  B'  in  M' 

(where  C'  is  the  maximal  x-p  node  that  C  represents  there,  and  B'  is  the  minimal). 

When  the  global  model  of  a  parallel  pair  is  transitively  bounded,  all  entries  of  all  timestamp 
vectors  are  defined.  When  the  parallel  pair  is  Type  2  as  well,  the  definition  becomes  much  simpler, 
since  each  non-extremal  node  in  M  corresponds  to  a  unique  node  in  M'. 

View-completeness  endows  timestamp  vectors  with  another  useful  property: 

Theoram  2.4  Suppose  parallel  pair  (M,  M^)  is  Type  1 .  Let  Ai , ...,  A^  be  mutually 
concurrent  nodes  in  a  graph  from  M;  let  p  be  a  process  at  which  no  A^  occurs,  and  let 
node  B  be  the  p-maximal  node  among  the  p  entries  of  the  vectors  V(A,). 

Then  there  exists  a  minimal  acyclic  node  C  following  B  dip,  and  C  is  concurrent  vith 
each  Ai. 

Proof  This  result  follows  directly  from  the  proof  of  Theorem  2.1.  □ 

Suppose  A  is  a  node  in  a  partial  .ORDER -TIME  graph  that  does  not  occur  at  process  p.  One 
implication  of  Theorem  2.4  is  that  the  node  following  the  p  entry  of  V(A)  is  mutually  concurrent 
with  A. 

Rollback  Vectors  We  can  define  rollback  vectors  as  the  dual  to  timestamp  vectors.  The  rollback 
vector  for  a  node  A  is  the  vector  R(  A)  consisting  of  the  minimal  node  at  each  process  that  follows 
or  equals  A.  That  is,  let  B  be  the  process  p  entry  x-p  R(  A).  Then  A  B  in  M,  and  each  node  C 
at  each  process  p  satisfies  the  statement; 

A  =1  C  in  M  C'  in  x-pM^ 

(Again,  let  B'  be  the  maximal  node  that  B  represents  in  the  p  timeline,  and  let  C  be  the  minimal 
that  C  that  C  represents.) 

Just  as  timestamp  vectors  describe  the  maximal  history  cone  of  an  node,  rollback  vectors 
describe  the  minimal  future  cone.  The  term  “rollback  vector”  originates  in  this  fact:  if  A  were  to 
be  instantaneously  undone,  R(A)  describes  the  frontier  of  the  region  to  be  rolled  back.  The  dual 
of  Theorem  2.4  holds  for  rollback  vectors. 


Precedence  of  Vectors  The  linear  order  on  individual  timelines  induces  a  natural  relation  on 
vectors:  we  say  that  V  when  Xp  V  Xp  W  for  each  process  p,  but  V  ^  W. 
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The  Timeslice  Lattice  Vectors  of  nodes,  one  per  process,  induce  some  natural  entry-wise 
operations— one  of  which  we  have  already  done.  For  vectors  X  and  Y,  define  their  meet  X  n  Y 
to  be  the  vector  obtained  by  taking,  for  each  process  p,  the  minimal  p  entry  from  X  and  Y.  Define 
the  join  X  UY  symmetrically  by  taking  the  entry-wise  maximum. 

We  know  that  timeslices  from  IVpc  1  parallel  pairs  are  consistent  cuts,  and  thus  have  a  vector 
structure.  In  Type  2  parallel  pairs,  the  meet  and  join  operations  preserve  this  property. 


Theorem  2.5  Suppose  X  and  Y  are  consistent  cuts  in  a  Type  2  parallel  pair.  Then 
X  r\Y  and  X  U  V  are  both  consistent  cuts. 


Proof  Suppose  Z  =  X  □  V  is  not  a  timeslice.  Then  Z  must  equal  neither  X  nor  Y.  There  must 
exist  processes  p  and  q  such  that  X  contributes  the  process  p  entry  of  Z  and  Y  contributes  the  q 
entry,  but  these  entries  are  not  mutually  concurrent,  h&t  A,  B  he  the  p  entries  of  X,  K  (respectively ), 
and  C,  £)  be  the  9  entries.  By  hypothesis,  A  — *■  B  but  D  — ►  C.  If  A  and  D  are  not  concurrent, 
then  either  A  — ►  C  (so  X  is  not  a  timeslice)  or  D  — y  B  (so  Y  is  not  a  timeslice). 

The  case  for  join  is  symmetric.  □ 

Entry-wise  precedence  -<  partially  orders  consistent  cuts;  in  this  order,  X  U  K  is  the  least  con¬ 
sistent  cut  dominating  consistent  cuts  X  and  Y  and  X  n  K  is  the  greatest  consistent  cut  dominated 
by  X  and  Y.  These  observations,  along  with  Theorem  2.5,  suffice  to  establish  that  timeslices  form 
a  lattice:  a  nonempty,  partially-ordered  set,  such  that  each  pair  of  elements  has  a  least  upper  bound 
and  greatest  lower  bound  in  the  set  [DaPr90]. 


Theorem  2.6  Timeslices  in  Type  2  parallel  pairs  form  a  lattice. 


Adjusted  Vectors  An  easy  variation  of  Theorem  2.6  is  that  the  set  of  timeslices  containing 
some  specified  node  also  forms  a  lattice  (since  meet  and  join  will  preserve  this  membership).  The 
bounds  on  this  lattice  derive  directly  from  timestamp  and  rollback  vectors. 

Let  i4  be  a  non-extremal  node  at  process  p  in  a  graph  from  a  Type  1  parallel  pair.  Theorem  2.4 
tells  us  that  for  each  9  ^  p,  a  minimal  acyclic  node  exists  in  the  q  timeline  following  the  q  entry  of 
\{A).  Dehne  the  adjusted  timestamp  vector  V*(A)  by  replacing  each  non-p  entry  in  V(>4)  by  this 
“successor.”  Similarly  define  the  adjusted  rollback  vector  R''(A)  by  replacing  each  non-p  entry 
with  its  “predecessor”:  the  maximal  acyclic  node  preceding  the  R(/4)  entry. 

For  acyclic  Type  2  models,  this  construction  is  stated  more  simply:  if  A  occurs  at  p,  obtain 
V*(  A)  by  replacing  each  each  non-p  entry  of  V(A)  by  its  immediate  successor,  and  obtain  R*(.4) 
by  replacing  each  non-p  entry  of  R(A)  by  its  immediate  successor. 
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Theorem  2.7  Let  (M,  M')  be  a  Type  1  parallel  pair.  Let  i4  be  an  acyclic  node  from 
an  M  graph.  Let  {X|,  ...,Xfc}  be  the  set  of  all  timeslices  containing  A.  Then 

V(^)  =  X,  n  ^2  n ...  n  X* 

R*(>1)  =  X|UX2U...UXt 

Proof  From  Theorem  2.4,  A  is  either  concurrent  with  or  equals  each  element  in  its  adjusted 
vector.  From  Corollary  2.3,  timeslices  exist  containing  A  and  each  of  these  elements.  Thus  the 
bound  can  be  achieved.  By  definition,  no  element  from  V(>1)  or  R(i4)  except  A  can  be  in  a 
timeslice  with  A.  Further,  no  cyclic  node  can  be  in  a  timeslice  with  A.  Thus,  these  bounds  are 
tight.  □ 


2.4.  Clocks  for  Distributed  Time 


Section  2.4.1  sketches  some  clock  primitives  for  time  models.  Section  2.4.2  sketches  some  clock 
primitives  for  parallel  pairs.  Section  2.4.3  considers  issues  of  when  clocks  have  sufficient  infor¬ 
mation  to  answer  these  queries.  Section  2.4.4  discusses  how  timestamp  vectors  form  a  basis  for  an 
implementation  of  these  primitives. 

An  Implicit  Parameter  The  behavior  of  clock  primitives  will  all  be  specified  in  terms  of 
the  ground-level  computation  graph  current  at  the  time  of  execution.  We  denote  this  graph  by 
CUR.CRAPH.  We  do  not  include  this  graph  as  an  explicit  parameter  since  the  processes  that  will 
invoke  these  primitives  will  not  have  explicit  access  to  this  graph. 


2.4.1 .  Primitives  for  Time  Models 

Suppose  time  model  M  acts  on  ground-level  computation  graphs.  We  define  the  most  fundamental 
clock  primitive: 

•  PRECEDES{A,  B,  M)  returns  true  iff  A  and  B  are  nodes  in  the  graph  M{CUR. GRAPH)  and 
an  edge  in  this  graph  connects  A  to  B. 

PRECEDES  allows  us  to  implement  two  other  primitives: 

•  CONCURRENT{A,  B,  M)  returns  true  iff  A  and  B  are  nodes  in  graph  M.{CUR. GRAPH),  and 
in  this  graph,  A  and  B  are  concurrent. 

CONCURRENT{A,B,M)  = 

^PRECEDES{A,B,M)  A  ^PRECEDES{B,  A,M) 
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•  ACYCUC{A,  M)  returns  true  iff  node  A  is  acyclic  in  M{CUR. GRAPH). 

ACYCUC{A,M)  =  -'PRECEDES{A,  A,M) 

This  specification  raises  some  questions.  Are  the  primitives  well-defined?  How  do  processes 
provide  these  parameters?  The  remainder  of  this  section  considers  these  issues. 


WellHlefined  Answers  Suppose  system  computation  extends  the  current  ground-level  com¬ 
putation  graph  from  a  to  a'.  If  A  — *  B  in  a,  will  A  — ►  B  in  a'?  If  A  -/-+  B  in  a,  will  A  B 
in  q'?  The  monotonicity  definitions  in  Section  2.2.5  provide  some  answers: 

•  If  the  time  model  M  is  strongly  monotonic,  then  the  PRECEDES  primitive  is  well-defined. 

•  If  the  time  model  M  is  only  weakly  monotonic,  then  the  PRECEDES  primitive  still  behaves 
reasonably,  with  the  exception  of  occasionally  changing  from  false  to  true  as  computation 
progresses. 

We  make  the  implicit  assumption  that  the  models  we  define  primitives  for  are  strongly  monotonic. 
However,  we  note  that  the  weakly  monotonic  case  can  also  be  made  to  work  once  we  handle  the 
problem  of  convergence:  knowing  when  a  precedence  answer  become  stable. 


Node  Names  Processes  using  these  primitives  must  specify  nodes  as  parameters.  Specifying 
these  primitives  begged  the  question  of  how  processes  themselves  should  refer  to  nodes.  We 
assume  that  nodes  in  a  computation  have  unique  names.  Whether  names  should  be  mere  identifiers 
(e.g.,  “node  73  at  process  12”)  or  mote  complete  descriptions  (e.g.,  “node  73  at  process  12:  state 
change  from  to  qy”)  is  another  issue.  This  naming  convention  carries  an  implicit  assumption: 
from  the  information  in  a  node  name,  one  may  extract  the  process  at  which  the  node  occurred. 


Shifting  Modeis  We  use  these  simple  primitives  to  ask  about  precedence  in  a  model  M. 
However,  a  natural  extension  is  to  ask  about  other  types  of  precedence  using  other  models.  For 
example,  in  a  parallel  pair  (M,  M^),  we  can  ask  about  individual  steps  at  processes  using  M',  or 
about  precedence  at  process  p  using  TpM'.  The  format  of  PRECEDES  and  CONCURRENT  already 
grants  this  ability;  we  us ;  the  model  parameter  to  specify  the  appropriate  model.  However,  shifting 
nodes  between  levels  in  a  parallel  pair  can  be  tricky,  because  a  node  from  a  parallel  pair  exists  on 
three  levels: 

•  as  a  node  in  the  global  graph; 

•  as  the  set  of  nodes  it  corresponds  to  in  the  disjoint  union  of  the  local  graphs; 

•  as  the  set  of  nodes,  if  any,  it  corresponds  to  in  each  process  graph. 
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In  general,  shifting  levels  requires  some  care  to  avoid  ambiguity.  For  Type  2  parallel  pairs  such 
as  (PARnAL-ORDER.TlME,Tl%4ELlNES),  this  multiplicity  is  simple:  each  node  in  the  global  partial 
order  represents  exactly  one  node  at  one  process,  except  for  L  and  T. 


2.4.2.  Primitives  for  Pairs 

We  define  some  additional  primitives  for  parallel  pairs  and  nonlinear  pairs.  We  assume  that  our  pair 
is  Type  3:  both  strongly  monotonic  and  lype  2.  (Strong  monotonicity  assures  us  that  precedence 
relations  are  well-defined;  IVpe  2  provides  convenient  node  structure.) 


A  Primitive  for  “Now”  First,  processes  need  access  the  name  of  their  current  node.  We  specify 
a  primitive: 

•  CUR.NODE{p,{M,M'))  returns  the  name  of  the  current  process  p  node  in  the  graph 
M^CUR.GRAPH). 

We  allow  only  process  p  to  ask  CUR.NODE{p,  (M,  M')). 

Vector  Operations  Processes  need  to  perform  vector  operations  in  nonlinear  pairs.  We  specify 
two  primitives: 

•  MAX{V,  W,  (M,  M'))  is  defined  for  vectors  V  and  W  of  nodes  from  M{CUR. GRAPH),  and 
returns  the  entry-wise  maximum  (using  M'  to  sort  entries). 

•  COMMRE{V,  W,  (M,  M'))  is  defined  for  vectors  V  and  W  of  nodes  from  M{CUR.GRAPH), 
and  is  true  xffV  -^W  (using  M'  to  sort  entries). 


Meta-Primitive  We  want  to  define  enumerative  primitives  for  our  clock  suite.  We  begin  by 
defining  two  “meta-primitives”  as  building  blocks.  Let  /I  be  a  variable  representing  an  unspecified 
node,  and  $  be  a  predicate  on  A.  We  specify  two  meta-primitives: 

•  UST{A,^{A),l3)  returns  the  set  of  nodes  in  graph  0  that,  when  substituted  for  A,  satisfy 

•  NODE{A,^{A),^)  returns  the  single  node  A  from  ^  satisfying  4>(A)  (and  is  undefined 
otherwise). 

These  meta-primitives  themselves  are  off-limits  for  processes. 
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Primitives  that  Enumerate  We  use  list  and  node  to  build  clock  primitives  that  enumerate 
nodes,  rather  than  merely  providing  Boolean  answers.  These  primitives  apply  to  Type  3  parallel 
pairs  only.  (Recall  that  in  a  Type  2  parallel  pair  (M,  M'),  we  identify  nodes  in  M  with  nodes  in 
M'.)  We  specify  three  primitives: 

•  NEXTXp,  i4,  (M,  M'))  returns  the  M  node  that  follows  node  A  in  the  process  p  timeline. 

NE)nXp,A,iM,M'))  = 

NODE{B,  PRECEDES{A,  B,  ir^  M'),  MV  {CUR. GRAPH)) 

•  PREVIOUS{p,  A,  (M,  M'))  returns  the  M  event  that  precedes  node  A  in  the  process  p  time¬ 
line. 

PREVIOUS{p,A,{M,M'))  = 

NODE{B,  PRECEDES{B,  A,  Vj,  M'),  M' {CUR. GRAPH)) 

•  UST.CONCURRENT{p,  A,  (M,  M'))  returns  the  acyclic  M  nodes  at  process  p  that  are  con¬ 
current  with  event  A. 


UST.CONCURRENT{p,A,{M,M'))  s 
UST{B,  CONCURRENT{A,  B,  M)  A  ACYCUC{B,  M),  M'{CUR. GRAPH)) 

2.4.3.  Knowable  Pursuits 

Section  2.4. 1  considered  when  the  temporal  relation  that  a  clock  primitive  examines  is  weii-dehned. 
However,  we  have  not  examined  when  when  a  process  executing  of  a  clock  primitive  in  an  unfolding 
computation  will  have  sufficient  information  to  obtain  this  well-defined  answer.  For  example: 

•  When  should  the  clock  at  process  p  be  expected  to  handle  queries  about  a  node  4? 

•  What  precedence  relations  should  the  clock  at  p  be  expected  to  know  about? 

To  answer  these  questions,  we  informally  consider  an  “Elephant-Pig  Paradigm”:  processes 
never  forget  anything,  and  always  piggyback  each  link  in  a  precedence  path  with  complete  knowl¬ 
edge.  Although  this  paradigm  would  not  be  met  by  real  implementations,  it  serves  to  a  starting 
point.  Suppose  we  use  parallel  pair  (M,  M')  to  describe  computation,  and  node  C  occurs  at  process 
p.  We  specify  some  clock  guidelines: 

•  If  A  — ►  C  in  M,  then  process  p  at  node  C  may  ask  about  A. 

•  If  A  and  B  both  precede  C  in  M,  then  process  p  at  node  C  may  ask  about  the  relation 
between  A  and  C. 
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For  our  models,  flow-support  reasonably  approximates  the  Elephant-Pig  Paradigm.  If  we 
restrict  the  knowsd>ility  questions  to  a  parallel  pair  (M,  M^)  is  also  flow-supported  (e.g.,  (M,  M') 
is  a  IVpe  4  parallel  pair),  this  sketch  provides  some  answers: 

•  Process  q  at  node  C  €  M{CVR.  GRAPH)  will  get  an  answer  from  the  queries 

PRECEDES{A,  B,M') 

PRECEDES{Ay  B,M) 

iff  B  =♦  C  in  M{CUR.GRAPH). 

•  Process  q  at  node  C  €  M^{CIJR. GRAPH)  will  get  an  answer  from 

iV£X7T(p,  A,  (M,  M')) 

iff  this  node  exists,  and  precedes  or  equals  C  in  IA{CUR. GRAPH). 

•  Process  q  at  node  C  6  M{CUR. GRAPH)  will  get  an  answer  from 

PREVIOUSip,  B,  (M,  M')) 
iff  B  =i  C  in  M{CVR.GRAPH). 

•  Realistically,  it  seems  unreasonable  for  a  process  to  know  everything  in  its  past.  Consequently, 
we  restrict  UST.  CONCURRENT  to  examine  local  nodes  only.  Only  process  p  can  query 
UST.CONCURRENT{p,  A,  (M,  M'))i  only  for  A  preceding  the  query  node. 


2.4.4.  An  Implementation 

Vector  clocks  provide  a  natural  approach  for  tracking  temporal  precedence  in  parallel  pairs. 
Historically,  research  in  partial  order  time  includes  vector-based  clock  implementations  [StYe8S, 
Fi88,  Fi9I,  Ma87,  KeKo89,  Ma89].  Indeed,  the  term  “vector  time”  has  surfaced  for  partial  order 
time,  although  we  feel  this  is  a  misnomer  as  it  confuses  an  implementation  with  the  underlying 
structure.  (However,  these  particular  implementations  do  permit  extra  elegance  in  some  applica¬ 
tions.) 

The  vector  relation  on  timestamp  vectors  follows  the  temporal  relation  on  the  events. 


Theorem  7.8  Any  two  nodes  A  and  B  from  a  Type  1  parallel  pair,  satisfy  the 
statement: 


V(A)-!V(B)  ^  {A-^BAB-hA) 
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Proof  Each  entry  of  a  timestamp  vector  for  a  given  event  precedes  or  equals  that  event.  If 
A  — *  B  then,  then  the  p  entry  of  V(A)  precedes  or  equals  B,  and  thus  precedes  or  equals  the 
maximal  node  at  p  preceding  B.  Conversely,  suppose  V(>1)  x  V(fi)  and  A  occurs  at  process  q. 
Then  A  precedes  w  equals  the  q  entry  of  V(i4),  which  precedes  or  equals  the  q  entry  of  V(5), 
which  precedes  or  equals  B.  □ 

(In  fact,  this  theorem  holds  for  parallel  pairs  more  general  than  Type  I .  Only  transitive  bounding 
is  required.) 

Strong  monotonicity  implies  that  the  timestamp  vector  for  an  event  can  actually  be  defined  at 
some  point  in  the  computation.  Flow-completeness  implies  that  the  timestamp  vector  for  an  event 
can  actually  be  defined  when  the  event  occurs.  Consequently,  to  implement  clocks  for  Type  4 
parallel  pairs  using  timestamp  vectors,  we  just  have  each  process  maintain  a  local  counter  and  a 
“current”  timestamp  vector.  When  a  process  sends  a  message,  it  piggybacks  the  timestamp  vector 
of  the  send;  when  a  process  receives  a  message,  it  updates  its  current  vector  to  be  the  entry-wise 
maximum.  Timestamp  vectors  allow  direct  implen^ntation  of  the  PRECEDES  and  CONCURRENT 
primitives,  and,  along  with  some  facility  for  remembering  history  and  event  descriptions,  allows 
implementation  of  the  remainder  of  the  primitives. 

Timestamp  vectors  also  function  as  clocks  for  more  general  types  of  parallel  pairs,  such  as  those 
lacking  flow-support,  and  those  whose  process  timelines  are  themselves  partial  orders.  The  imple¬ 
mentation  becomes  somewhat  more  complicated  in  these  scenarios,  however.  For  example,  non- 
flow-suppoited  models  suffer  from  an  information  gap:  when  event  A  occurs  at  process  p,  process 
p  may  not  have  sufficient  information  to  sort  A.  The  answer  to  the  query  PRECEDESiMz,  A,B,q) 
depends  on  when  the  query  is  made — and  we  need  a  time  model  M|  that  refines  to  M2  to  capture 
this  parameter.  (This  scenario  is  an  example  of  parameterized  clocks.)  Alternatively,  when  a 
process  timeline  is  itself  a  partial  order,  we  need  to  distribute  information  so  that  other  processes 
can  perform  the  vector  clock  algorithm — sorting  two  events  at  process  p  is  no  longer  a  matter  of 
comparing  two  scalars.  (Chapter  4  and  Chapter  S  discuss  these  issues  in  more  detail.) 

In  principle,  rollback  vectors  also  function  as  clocks  (the  dual  to  Theorem  2.8  holds),  but 
information  gaps  makes  implementation  impractical. 


2.5.  Example  Applications 

2.5.1.  Orphan  Detection 


An  immediate  application  of  distributed  time  is  accurate  orphan  detection.  When  an  event  is 
aborted,  any  event  that  could  have  been  influenced  by  the  aborted  event  is  an  orphan  and  should 
be  undone. 
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Tracking  this  dependence  in  an  asynchronous  distributed  system  is  difficult.  For  example, 
using  real  time  to  label  as  an  orphan  any  event  with  a  timestamp  greater  than  the  aborted  event 
will  generate  false  positives,  and  not  extend  to  work  in  environments  lacking  synchronized  real 
time  clocks.  Using  a  total  order  consistent  with  the  underlying  computation  also  generates  false 
positives — and  fails  to  extend  to  scenarios  such  as  rollback  recovery,  in  which  the  final  (replayed) 
instance  of  an  event  may  actually  occur  later  than  an  event  it  influenced. 

The  tools  of  distributed  time  solve  these  problems  by  allowing  us  to  talk  about  time  as  a  partial 
order,  and  by  allowing  us  to  move  transparently  from  the  partial  order  representing  the  physical 
computation  to  a  more  abstract  partial  order  that  represents  a  virtual  computation. 


2.5.2.  Immediate  Ordered  Service 


The  problem  of  immediate  ordered  service  consists  of  servers  processing  requests  from  clients  in 
an  asynchronous  distributed  system.  Each  server  has  a  list  of  outstanding  requests.  How  can  the 
server  choose  the  “earliest”  entry  to  process  without  necessitating  additional  communication  and 
discussion? 

This  problem  can  be  solved  by  applying  a  partial  order  time  model  to  the  computation,  and  hav¬ 
ing  servers  use  partial  order  clocks  to  sort  the  incoming  requests.  The  immediate  response  time  of 
vector  clocks  makes  that  implementation  particularly  attractive — especially  in  a  distributed,  asyn¬ 
chronous,  and  frequently  disconnected  environment.  (Indeed,  the  published  solution  [KeKo89i  to 
this  problem  is  one  of  the  independent  discoveries  of  the  vector  clock  mechanism.) 


2.6.  Comparison  to  our  Earlier  Publication 

We  presented  much  of  this  material  in  an  earlier  publication  [Sm93].  That  version  was  usually  more 
detailed,  but  was  also  more  preliminary.  This  section  briefly  discusses  some  of  the  differences. 

When  defining  process  automata,  the  earlier  publication  allowed  processes  to  know  their  input 
queue  was  nonempty,  but  proceed  without  receiving  any  message.  That  approach  unintentialiy 
permitted  anonymous  influence;  process  p  may  act  on  the  knowledge  that  a  message  has  arrived 
from  process  q,  but  our  time  models  would  not  establish  a  precedence  path  from  q  to  p.  The  revised 
message  rule  in  this  thesis  prohibits  this  scenario. 

When  developing  time  models,  the  earlier  publication  primarily  took  the  event-based  approach. 
This  approach  created  problems  with  view-completeness  and  complicated  discussion  of  certain 
application  problems.  This  thesis  avoids  these  problems  by  including  both  events  and  states  as 
nodes  in  computation  graphs.  In  this  respect,  the  construction  of  partial -ORDER  .time  in  this 
thesis  differs  from  the  POT  model  of  the  earlier  publication.  (In  particular,  partial.  ORDER  .time 
ensures  that  a  state  node  separates  any  two  event  nodes.) 


40 


When  defining  properties  of  time  models,  the  earlier  publication  did  not  formally  examine  issues 
of  information  flow.  The  definitions  of  flow-supported,  flow-v'rtual,  and  monotonicity  appear  for 
the  first  time  in  this  thesis. 

Chapters  7  and  8  of  the  earlier  publication  gives  a  much  fuller  discussion  of  parallel  pairs  and 
factoring  mod'’'s  than  Section  2.2.6  of  this  thesis  presents.  However,  the  earlier  publication  did  not 
explore  nonlinear  pairs,  and  took  a  different  approach  to  examining  the  taxonomy  of  pairs  The 
definitions  of  Type  1  through  Type  4  appear  appear  for  the  first  time  in  this  thesis. 

The  earlier  publication  provides  a  more  detailed  derivation  of  the  timeslice  results.  Theorem  2.2 
in  this  thesis  reflects  Theorems  13.6  through  13.8  of  the  earlier  report. 

Theorem  2.6  in  this  thesis  considers  only  timeslices  from  Type  2  parallel  pairs,  although  we  can 
show  that  consistent  cuts  in  general  parallel  pairs  form  lattices,  as  do  timeslices  from  any  transitive 
graph.  The  general  case  is  difficult  due  to  two  facts; 

•  Some  consistent  cuts  may  contain  nodes  that  touch  more  than  one  process,  but  not  all  of 
them. 

•  Some  timeslices  may  not  be  consistent  cuts. 

In  particular,  the  definitions  here  of  the  n  and  U  operations  and  the  x  relation  work  correctly  for 
the  well-behaved  vector-like  cuts  in  independent  consistent  parallel  pairs.  More  general  models 
require  more  careful  definitions.  Chapter  9  of  the  earlier  publication  provides  the  full  details. 
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Chapter  3 

Distributed  Snapshots 


3.1.  Overview 


The  distributed  snapshot  problem  provides  a  straightforward  application  of  the  distributed  time 
framework.  In  an  asynchronous  distributed  system,  what  one  process  perceives  about  the  rest  of 
system  is  always  out-of-date.  This  limitation  complicates  the  problem  of  capturing  a  snapshot:  a 
mosaic  depicting  the  global  state  of  the  system  at  some  instant. 

In  real  life,  we  think  of  time  as  a  linear  sequence  of  moments.  Consequently,  we  find  it  only 
natural  to  think  of  computations  as  linear  sequences  of  global  states.  Barring  anything  unusual, 
this  linear  model  actually  describes  the  behavior  of  the  system.  Unfortunately,  the  asynchrony  and 
the  distribution  in  the  system  make  it  difficult  for  processes  within  a  system  to  obtain  global  states. 
For  example,  suppose  process  p  at  time  t  wants  to  take  a  snapshot  of  of  the  global  state  of  the 
system  at  time  t,  or  even  at  some  unspecified  time  close  to  t.  Although  process  p  needs  knowledge 
of  the  other  processes  in  order  to  put  together  this  picture,  any  knowledge  it  may  obtain  will  be 
stale,  because  information  travels  at  a  finite  speed.  Further,  the  unpredictable  message  delays  mean 
process  p  cannot  even  know  how  stale  this  knowledge  is. 


The  Traditional  Solution  In  their  foundational  paper  on  snapshots,  Chandy  and  Lamport 
[ChLa85]  present  an  elegant  marker-pushing  protocol  that  works  despite  this  limitation.'  A 
process  initiating  the  protocol  receives  an  approximately  current  snapshot  with  a  counter-intuitive 
correctness  property:  while  this  snapshot  may  not  necessarily  describe  the  state  of  the  system  at 
any  single  instant,  it  describes  a  consistent  state  of  the  system. 

That  is,  suppose  process  p  initiates  a  snapshot  protocol  at  time  to,  and  at  time  ^|  receives  a 
snapshot:  a  tuple  X  describing  the  local  state  at  each  process.  There  exists  a  well-defined  history 
function  H  taking  each  t  in  the  interval  [<o, <i]  to  its  global  state  H{t):  the  tuple  consi.sting  of 
the  local  process  states  at  t.  Intuition  suggests  that  the  snapshot  X  ought  to  be  the  value  of  H  at 
some  instant  in  this  interval,  .\synchrony  causes  this  intuition  to  fail.  Lacking  perfect  knowledge, 
process  p  cannot  obtain  the  H  values;  lacking  real  time  clocks,  process  p  cannot  even  obtain  the 


'Their  system  model — lossless  FIFO  message  channels — somewhat  constrains  the  asynchrony. 


t  values.  A  process’s  only  sense  of  time  derives  from  the  messages  the  process  receives  and  the 
actions  it  takes. 


Here  lies  the  rub:  many  valid  histories  H'  exist  with  =  fi'iio),  H(ti)  =  H'{ti)  and 
where  each  process  perceives  the  same  temporal  relations.  The  global  state  X  may  not  necessarily 
be  an  intermediate  value  from  the  history  that  actually  occurred,  but  it  will  be  an  intermediate  value 
from  an  equivalent^  history. 

What  is  more,  this  is  the  best  we  can  do.  A  consistent  global  state  is  consistent  with  the 
processes’  observations.  Hence  a  snapshot  recording  a  consistent  global  state  is  the  most  accurate 
picture  a  process  can  obtain:  anything  more  accurate  would  require  more  detailed  observations — 
which  would  change  the  computation. 

Even  though  it  may  never  have  occurred,  a  consistent  system  state  still  says  useful  things  about 
the  computation  in  progress.  For  example,  if  property  $  is  stable — it  remains  true  once  it  becomes 
true — then  examining  a  past  consistent  system  state  for  $  may  suffice  to  determine  if  $  holds  at 
the  current  instant. 


Subsequent  Research  Subsequent  research  in  snapshots  explored  variations  on  marker¬ 
pushing  protocols  [SpKe86,  LaYa87,  Ve89,  NeTo90,  Ma93],  characterized  the  state  lattice  that 
arises  from  slices  across  partial  order  time  [Ma89,  Jo89,  JoZw90],  and  modified  the  message  deliv¬ 
ery  model  by  relaxing  the  FIFO  requirement,  and  by  adding  various  flushing  primitives.  Work  also 
progressed  in  developing  applications  of  distributed  snapshots  in  deadlock  detection  [Ma87],  in 
checkpointing  [Jo89,  JoZw90,  Jo93]  and  in  distributed  debugging  [Fi89,  Sp89],  including  efforts 
to  use  timestamp  vectors  to  c^ture  consistent  states  with  specific  properties  [CoMa91,  MaNe91, 
MaSa91,  ToGa93,  GaWa94].  (Taylor’s  work  [Ta89,  CrTa901  uses  a  more  deterministic  notion  of 
snapshot — ^processes  must  know  at  the  time  a  state  occurs  that  that  state  is  part  of  a  snapshot — and 
thus  her  results  do  not  apply  here.) 


Using  Distributed  Time  The  snapshot  problem  demonstrates  that  the  standard  way  of  thinking 
about  computation — as  a  linear  progression  of  system  states — does  not  work  in  an  asynchronous 
distributed  system.  The  unsuitability  of  linear  time  makes  the  snapshot  problem  an  attractive 
demonstration  area  for  the  distributed  time  framework. 

In  Chapter  2  we  phrased  global  states  in  terms  of  timeslices  from  a  computation  graph,  and 
specified  clocks  for  these  temporal  relations.  This  framework  allows  straightforward  snapshot 


^This  phenomenon  is  seductively  similar  to  particie-wave  duality.  Processes  may  construct  a  set  of  possible  paths  for 
the  computation.  Although  the  computation  takes  one  path  in  particular,  processes  can  never  know  which  one:  each 
snapshot  causes  the  set  to  “jump”  to  one  value,  not  necessarily  the  real  one.  Manthey  [MaMo83,  Ma90a,  Ma90bJ  has 
explored  the  use  of  computational  abstractions  to  model  physics,  and  compiled  a  list  of  physical  phenomena  that  arise 
as  side-effects  of  computational  behavior.  The  snapshot  problem  suggests  an  interesting  extension  to  this  research: 
exploring  what  physical  phenomena  may  arise  as  side-effects  of  temporal  behavior. 
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protocols:  processes  use  their  distributed  time  clocks  to  assemble  timeslices.  This  approach  offers 
three  significant  advantages; 

•  Rexibility  Giving  processes  the  ability  to  sort  events  and  states  in  terms  of  the  logical  time 
model  permits  snapshot  protocols  that  are  much  more  flexible  than  the  traditional  marker¬ 
pushing  protocols.  For  example,  we  can  take  multiple  snapshots  and  snapshots  containing 
an  arbitrary  past  event. 

•  Orthogonality  Between  Protocols  and  Clocks  By  encapsulating  the  problems  of 
tracking  time  in  alternative  models  into  clocks  for  these  models,  we  separate  the  implemented 
from  the  implementations.  This  orthogonality  allows  us  to  modify  clock  protocols — ^perhaps 
due  to  changing  system  environments  or  efficiency  goals — without  modifying  the  higher 
level  ^plication  protocols.  For  example,  we  can  transparently  add  security  and  privacy  to 
these  snapshot  protocols  by  using  more  secure  clocks.  (Chapters  5  and  6  consider  these 
issues.) 

•  Orthogonality  Between  Protocols  and  time  Models  We  define  global  states  and 
snapshot  protocols  relative  to  a  time  model.  This  model  encapsulates  the  logical  timing 
issues:  physical  reality  may  determine  some  linear  order,  but  we  pretend  the  model  describes 
what  actually  happens.  However  our  notion  of  what  actually  happens  may  change.  For 
example: 

-  We  may  want  to  abstract  further  than  this  level — ^perhaps  by  pretending  only  global 
states  with  certain  properties  occur. 

-  We  may  want  to  increase  the  separation  between  this  level  and  physical  reality — perhaps 
by  allowing  for  rollback  with  modified  replay. 

Using  the  distributed  time  framework  allows  transparent  alteration  of  the  level  of  abstraction 
in  a  snapshot  protocol  by  using  more  abstract  time  models.  (For  example,  suppose  a  process 
rolls  back  and  performs  different  computation.  At  least  three  virtual  computations  may  arise: 
the  failed  computation,  the  failure-free  virtual  computation,  and  the  recovery  computation 
itself.  The  distributed  time  framework  allows  us  to  use  the  same  protocol  to  take  snapshots 
of  all  three  levels.  Chapter  4  discusses  these  issues  further.) 


Figure  3.1  sketches  this  approach. 


This  Section  Section  3.2  defines  the  snapshot  problem  in  terms  of  distributed  time,  sketches 
a  simple  protocol  to  find  snapshots  containing  any  arbitrary  event  or  state  (even  without  FIFO 
messages),  and  uses  some  of  our  theoretical  results  to  improve  this  basic  protocol.  Section  3.3 
considers  the  implications  of  using  basic  snapshot  protocols  with  more  abstract  time  models,  and 
shows  two  examples.  Section  3.4  explores  some  advanced  issues. 
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Rgura  3.1  Distributed  time  simplifies  protocol  design.  In  the  snapshot  appli¬ 
cation,  we  can  describe  the  target  in  terms  of  distributed  time:  snapshots  are 
timeslices,  instants  of  logical  simultaneity  in  the  temporal  relations  expressed  by  a 
time  model.  Clocks  for  these  time  models  permit  thinking  directly  in  terms  of  these 
relations,  and  thus  provide  the  necessary  primitives  for  snapshot  protocols. 


3.2.  The  Basic  Problem 


This  section  uses  distributed  time  to  examine  the  basic  problem  of  taking  a  snapshot.  Section  3.2. 1 
builds  a  basic  snapshot  protocol.  Section  3.2.2  uses  vector  clocks  and  the  lattice  structure  of 
timeslices  to  simplify  this  protocol. 


3.2.1.  Building  a  Basic  Protocol 

Informally,  we  think  of  a  global  state  as  what  is  happening  everywhere  at  some  moment  in  real 
time.  However,  we  do  not  want  a  description  of  everything  happening  everywhere  (where  would 
we  write  it  all  down?)  but  rather  a  list  of  convenient  abstractions.  The  restrictions  of  distributed 
asynchrony  confine  the  basic  snapshot  protocol  to  capturing  what  is  happening  at  some  monoent  in 
time  in  some  computation  consistent  with  what  processes  observe.  Thus  for  snapshot  applications, 
a  well-constructed  time  model  should  produce  graphs  that  have  two  properties; 

•  the  nodes  express  the  desired  abstractions,  and 

•  the  temporal  precedence  expresses  only  the  observable  orderings. 
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The  PARIIAL-ORDER-TIME  model  built  in  Chapter  2  has  these  properties. 

When  a  process  takes  a  snapshot,  it  wants  to  find  a  global  state  from  a  computation  consistent 
with  what  it  and  the  other  processes  are  observing.  By  Theorem  2.2,  these  consistent  global  states 
are  exactly  the  timeslices  from  the  global  partial  order. 

The  Round  Robin  Protocol  An  interesting  consequence  of  Corollary  2.3  is  that  for  Type  2 
(consistent  and  independent)  parallel  pairs,  any  set  of  mutually  concurrent  nodes — even  a  singleton — 
extends  to  a  consistent  cut.  This  means  that  the  following  naive  protocol  suffices  for  a  process  p 
to  take  a  snapshot.  Let  (M,  M')  be  a  Type  3  (consistent,  independent,  and  strongly  monotonic) 
parallel  pair.  The  protocol  assumes  the  processes  are  organized  into  a  directed  cycle,  and  performs 
the  following  steps: 

1.  Process  Pi  chooses  an  acyclic  node  A\  and  sends  {Ai}  as  a  partial  timeslice  to  Pi. 

2.  For  each  i  with  1  <  i  <  n,  process  P  receives  a  partial  timeslice  from  P,_i  and  appends  a 
local  acyclic  node  mutually  concurrent  with  each  node  in  the  timeslice.  If  i  <  n,  process  P. 
sends  the  new  partial  timeslice  on  to  P,>t.  If  t  =  n,  process  P„  sends  the  completed  snapshot 
back  to  Pi. 

Figure  3.2  presents  a  more  complete  description.  Processes  use  LIST.  CONCURRENT  to  enumerate 
nodes  from  their  own  timelines  (consistent  with  the  knowledge  restriction  from  Section  2.4.3). 
Corollary  2.3  guarantees  that  Ui  will  be  non-empty. 

>^th  modification,  the  Round  Robin  Protocol  extends  to  more  general  parallel  pairs.  For 
example,  if  the  LIST. CONCURRENT  call  were  guaranteed  to  be  answerable,  we  could  relax  the 
(M,  M')  requirement  to  Type  1  (consistent).  If  we  rewrote  the  protocol  to  allow  missing  entries 
from  the  Si  and  to  allow  some  (/,  to  be  empty,  we  could  even  dispense  with  consistency  requirement. 

Unlike  the  traditional  protocol,  the  Round  Robin  Protocol  does  not  require  FIFO  message 
delivery,  is  offline  (in  that  the  initiating  process  may  specify  any  arbitrary  seed  node),  and  allows 
each  process  some  leeway  in  choosing  what  node  to  include  in  the  snapshot. 


3.2.2.  Shortcuts 

The  Round  Robin  Protocol  is  simple  and  clear;  each  process  receives  a  partial  timeslice,  then 
finds  and  appends  a  local  node  mutually  concurrent  with  that  timeslice.  The  protocol  is  also 
fairly  inefficient;  the  multiple  UST.  CONCURRENT  calls  followed  by  the  set  intersection  take  time, 
and  n  rounds  of  communication  must  take  place  before  the  initiating  process  receives  the  desired 
snapshot. 

However,  examining  the  Round  Robin  Protocol  in  terms  of  vector  clocks  reveals  shortcuts  that 
improve  efficiency. 
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/♦  process  Pi  initiates  protocol  */ 
procedure  INITIATE 

/*  find  acyclic  node  to  seed  the  snapshot  */ 

rqieat 

CHOOSE  Ai  ^ 
namACYCUC{Ai,M) 

f*  create  partial  timeslice  Si  and  send  it  off  */ 

Si^{Ai} 

SEND  Si  to  P2 

/*  for  each  i  >  1,  process  Pi  receives  a  set  5,_i  and  cooperates  */■ 
procedure  COOPERATE 

/*find  the  local  nodes  that  are  concurrent  with  each  node  in  Si-i  */ 
for  j  =  1  to  t  —  1 

Aj*-  process  P,  entry  of  Si.i 
Tj*-UST.CONCURRENT{Pi,Aj,  (M,  M')) 

/*find  the  intersection  */ 

Ui*-C\i<j^i  Tj 

/♦  extend  the  partial  timeslice  */ 

CHOOSE  Ai  €  Ui 
Si^Si-i  U  {Ai} 

it  i  <n 

/*  if  incomplete,  send  the  partial  timeslice  to  the  next  process  */ 
then  SEND  Si  to 

/*  if  complete,  send  the  snapshot  back  to  Pi  */ 
else  SEND  Si  to  Pi 


Rgure  3.2  In  the  Round  Robin  Protocol  for  distributed  snapshots,  we  assume  the 
processes  are  organized  into  a  cycle  Pi,...,  P„.  Process  Pi  initiates  the  protocol  by 
choosing  a  local  node  that  is  acyclic,  and  passing  a  partial  timeslice  on  to  Pz.  For 
i  >  \,  process  P  receives  a  partial  timeslice  and  cooperates  by  extending  it,  and 
passing  it  on. 
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The  Reduced  Round  Robin  Protocol  Theorem  2.4  provides  a  way  to  combine  the  timestamp 
vectors  for  partial  timeslices,  and  to  find  nodes  with  which  to  extend  the  partial  timeslice.  This 
result  allows  us  to  reduce  both  the  messages  and  the  computation  in  the  Round  Robin  Protocol.  Let 
(M,  M')  be  a  Type  3  (consistent,  independent,  and  strongly  monotonic)  parallel  pair.  The  protocol 
assumes  the  processes  are  organized  into  a  directed  cycle,  and  performs  the  following  steps: 

1.  Process  Pi  chooses  an  acyclic  node  Ax  and  sends  the  vector  V(  Ai )  to  Pj. 

2.  For  each  i  with  1  <  i  <  n,  process  Pi  receives  a  vector  from  p_i  and  replaces  the  i  entry 
with  the  next  acyclic  node  Ai.  Pi  them  maximiKS  this  vector  against  the  timestamp  vector 
for  Aj.  If*  <  n,  process  P,  sends  the  new  vector  on  to  P,+|.  If*  =  n,  process  Pn  sends  the 
completed  snapshot  back  to  Pi . 

Figure  3.3  presents  a  more  complete  description. 

This  protocol  improves  on  the  Round  Robin  Protocol  by  encoding  the  partial  timeslice  S,  as  the 
first  t  entries  of  vector  Vi,  and  using  the  remaining  entries  of  the  vector  to  mark  the  upper  bound  of 
the  set  of  nodes  preceding  5,  . 

The  Completely  Reduced  Round  Robin  Protocol  Some  convenient  properties  of  vector 
clocks  made  the  Reduced  Robin  Protocol  possible.  As  Chapter  2  explained,  these  properties  have 
a  solid  theoretical  foundation; 

•  Timeslices  form  a  lattice. 

•  The  timestamp  vector  and  rollback  vector  of  a  node  delineate  the  bounds  of  the  sublattice  of 
timeslices  containing  that  node. 

Theorem  2.7  tells  us  that  if  a  process  wants  to  know  a  snapshot  containing  a  node,  then  the  adjusted 
timestamp  vector  of  that  node  suffices.  Thus  we  can  reduce  the  Round  Robin  Protocol  even  further. 
Let  (M,  M')  be  a  Type  3  (consistent,  independent,  and  strongly  monotonic)  parallel  pair.  Our 
completely  reduced  protocol  performs  the  following  step: 

•  For  a  process  p  to  find  a  snapshot  containing  node  A  at  q,  process  p  independently  ^sks  each 
process  r  ^  9  for  the  value  NEXT{r,  7rrV(  A),  (M,  M')). 

This  protocol  dispenses  with  the  assumption  that  processes  are  organized  into  a  cycle. 

In  general,  concurrence  is  not  transitive — to  find  the  next  element  of  a  partial  timeslice,  a 
process  must  check  every  element.  But  snapshots  from  the  adjusted  vectors  have  the  advantage  of 
being  canonical,  in  the  sense  that  the  identity  of  the  vector  (e.g.,  “the  adjusted  timestamp  vector  of 
A")  is  sufficient  to  determine  membership.  The  initiating  process  still  must  query  other  processes, 
but  these  queries  may  now  be  independent  rather  than  sequential. 
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/*  process  P\  initiates  protocol  */ 
procedure  INITIATE 

/*find  acyclic  node  to  seed  the  snapshot  */ 

repeat 

CHOOSE  >li  #_1 
until  ACYCUC{Ai,M) 

/*  send  off  a  vector  */ 

SEND  Vi  to  P2 

/*  for  each  z  >  1,  process  Pi  receives  a  vector  Vi_i  and  cooperates  */ 
procedure  COOPERATE 

/*  advance  Pi  entry  to  next  acyclic  node  */ 

7rp.Vi_i 

repeat 

Ai^NEXT{Pi,  (M,  M')) 
until  ACYCUC{AiM) 

/*  obtain  new  vector  */ 

Vi^MAX{Vi.uy{A),iMM)) 

if  i  <  n 

/*  if  incomplete,  send  the  new  vector  to  next  process  */ 
then  SEND  K  to 

/*  if  complete,  send  the  snapshot  back  to  Pi  V 
else  SEND  V,  to  Pi 


Figure  3.3  In  the  Reduced  Round  Robin  Protocol  for  distributed  snapshots,  we 
assume  the  processes  are  organized  into  a  cycle  Fi,...,Pn-  Process  Pi  initiates 
the  protocol  by  choosing  a  local  node  that  is  acyclic,  and  its  timestamp  vector  to 
P2.  For  each  z  >  1,  process  P,  receives  a  vector  whose  first  i  -  1  entries  comprise 
a  partial  timeslice,  and  whose  remaining  entries  are  the  maximal  nodes  preceding 
this  partial  timeslice.  Process  P,  cooperates  by  extending  the  partial  timeslice, 
updating  its  vector  encoding,  and  passing  it  on.  The  vector  maximization  step 
cannot  affect  the  first  i  entries,  since  these  form  a  partial  timeslice. 
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The  most  straightforward  way  to  assemble  an  adjusted  vector  takes  2  log  n  communication 
steps — use  a  binary  tree  to  send  out  the  requests,  and  another  to  collect  them.  However,  several 
natural  optimizations  suggest  themselves.  For  example,  if  A  is  sufficiently  far  in  the  past,  then 
process  p  may  already  know  some  values.  Also,  various  processes  may  know  the  components  for 
other  processes. 

Theorem  2.7  also  gives  us  ways  to  collect  sn^shots  without  actually  having  access  to  clocks. 
For  example,  process  p  could  obtain  V*(A)  (the  adjusted  timestamp  vector  of  A)  by  searching 
backwards  along  the  paths  of  messages  arriving  before  A.  In  fact,  the  traditional  snapshot  protocol 
uses  essentially  this  technique:  initiate  broadcast  at  node  B,  and  obtain  R*  ( .6 )  (the  adjusted  rollback 
vector  of  B). 


3.3.  Snapshots  from  Higher-level  Models 


Both  the  Round  Robin  variants  and  the  traditional  snapshot  protocol  obtain  a  single  snapshot.  A 
single  snapshot  is  satisfactory  from  a  linear  view:  the  question  of  “what  is  happening  right  now” 
should  only  have  a  single  answer.  But  distributed  asynchrony  gives  multiple  correct  answers.  This 
fact  made  Chandy  and  Lamport’s  work  appear  counter-intuitive,  and  helps  motivate  distributed 
time  theory. 

Many  global  states  may  contain  some  specific  node  from  process  p.  However,  process  p 
may  need  access  to  states  different  from  the  one  that  particular  run  of  a  single-snapshot  protocol 
provides.  Process  p  has  only  a  couple  of  approaches: 


•  The  desired  states  may  be  exactly  the  timeslices  in  some  higher-level  time  model  supported 
by  clock  primitives.  In  this  case,  taking  a  snapshot  in  this  alternate  model  suffices. 


•  Otherwise,  process  p  needs  either  to  extend  its  snapshot  protocol  to  handle  search-and- 
backtrack,  or  to  find  an  efficient  way  to  collect  sets  of  global  states  (so  it  can  perform  the 
search  locally). 


This  section  focuses  on  this  general  snapshot  problem  (finding  a  global  state  satisfying  some 
arbitrary  predicate  $)  by  taking  snapshots  from  higher-level  models.  Section  3.3. 1  outlines  an 
easy  example:  when  snapshots  satisfying  $  are  exactly  the  timeslices  from  a  higher-level  model. 
Section  3.3.2  presents  a  more  complicated  example:  capturing  a  large  class  of  snapshots  by  cap¬ 
turing  a  single  snapshot  from  a  higher-level  model. 
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3.3.1 .  The  Easier  Case 


Section  2.3.1  showed  that  if  a  time  model  is  doing  its  Job,  then  its  timeslices  correspond  exactly 
to  the  global  states  in  the  underlying  computations.  This  correspondence  assures  a  process  that  by 
obtaining  one  of  these  timeslices,  it  is  taking  a  snapshot  of  a  global  state. 

The  Special  Timeslice  Condition  Suppose  that  a  process  wants  snapshots  of  global  states 
that  additionally  satisfy  some  arbitrary  predicate  In  some  sense,  the  process  wants  to  pretend 
the  only  moments  of  simultaneity  that  exist  are  ones  that  satisfy  It  would  be  convenient  if  a  time 
model  would  express  this  pretense  by  admitting  only  the  interesting  global  states  as  timeslices.  To 
this  end,  we  revise  the  earlier  Timeslice  Condition. 

Suppose  time  model  M  on  ground-level  graphs  generates  the  set  Q,  and  predicate  $  on  specifies 
which  global  states  are  of  interest.  Model  M  satisfies  the  Special  Timeslice  Condition  for  $  iff  for 
each  /9  €  ^,  M  satisfies  these  requirements: 

1.  For  each  set  X  of  nodes  in  /5,  the  following  are  equivalent  statements: 

•  X  minimally  represents  a  global  state  Y  satisfying  0  in  some  ground-level  graph  a 
with  M(a)  =  /3. 

•  X  is  a  timeslice  in  0. 

2.  Each  ground-level  graph  a  with  0  =  M(a)  and  each  global  state  K  in  q  with  4>(y')  satisfy 
the  statement: 

•  if  ( M,  a )  ( A)  n  y  ^  <2)  for  some  node  A,  then  some  timeslice  in  0  minimally  represents 
K. 

We  build  time  models  by  specifying  nodes  and  precedence;  timeslices  follow  from  this  basic 
construction.  Hence  the  arbitrary  predi  ^te  $  must  possess  a  fair  degree  of  structure  (and  the  model 
builder  a  fair  degree  of  insight)  in  order  to  lead  to  a  time  model  satisfying  the  Special  Timeslice 
Condition.  For  this  iq}proach  to  work,  the  predicate  must  decompose  to  incomparability  under 
some  nicely  behaved  precedence  relation. 

Example:  No  Messages  In  Transit  As  an  exarnple,  suppose  a  process  wanted  only  global 
states  where  no  mess£^es  are  in  transit.  No  send  event  preceding  a  snapshot  X  has  its  receive  event 
following  X.  Whether  send  and  receive  events  themselves  ought  to  be  permitted  to  be  members  of 
such  snapshots  is  another  issue;  if  process  p  is  in  the  very  act  of  sending  or  receiving  a  message, 
is  that  message  in  transit  or  not?  We  choose  the  cleanest  approach:  we  permit  the  node  before  the 
send  or  after  the  receive,  but  not  the  transitional  events  themselves. 

In  the  PARTIAL- ORDER.  TIME  model,  a  consistent  cut  X  with  this  no-transit  property  is  genuinely 
a  cut  of  the  partial -ORDER -TIME  graph:  any  path  from  ±  to  T  must  touch  a  node  in  .V.  (Other 
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consistent  cuts  will  not  partition  the  graph,  since  message  edges  connect  the  past  of  the  cut  to  the 
future  of  the  cut.) 

If  process  p  never  wants  to  see  a  send  event  having  occurred  without  also  seeing  the  correspond¬ 
ing  receive  event,  it  essentially  wants  to  pretend  that  corresponding  send  and  receive  events  occur 
simultaneously.  We  formalize  this  pretense  by  defining  the  strong  model,  that  adds  an  edge  from 
each  receive  event  to  its  send  event.  The  strong  model  also  must  handle  the  case  of  unreceived 
messages.  We  take  the  approach  of  adding  an  edge  from  T  to  the  send  event  of  each  unreceived 
message.^  We  then  merge  the  atoms  cyclic  with  T  into  T.  We  define  the  strong  .  partial  .  order 
model  to  be  the  composition: 

strong  _  PARTIAL  .  ORDER  =  STRONG  o  PARTIAL  _  ORDER  .TIME 

The  new  STRONG  .PARTIAL  .ORDER  model  exhibits  much  of  the  same  theoretical  structure  as 
its  original  version.  For  example,  (strong,  partial  .order,  timelines)  is  still  a  consistent  and 
independent  parallel  pair,  so  Theorem  2.1,  Corollary  2.3,  Theorem  2.4,  Theorem  2.5,  Theorem  2.6, 
and  Theorem  2.7  all  still  hold. 

Conveniently,  the  STRONG  model  prohibits  send  and  receive  events  from  strong  timeslices, 
since  these  events  are  cyclic.  This  model  thus  obtains  the  desired  result: 

Theorem  3.1  The  strong  .partial,  order  model  satisfies  the  Special  Timeslice 
Condition  for  global  states  with  no  messages  in  transit. 

Proof  Suppose  message  M  from  p  to  ?  is  sent  at  S  and  received  at  i?.  In  STRONG .  partial  .  ORDER, 
any  node  preceding  R  precedes  any  node  following  5.  If  message  M  is  sent  by  process  p  at  5  but 
is  unreceived,  then  all  nodes  following  S  in  PAKITAL. ORDER. TIME  are  cyclic,  and  prohibited  from 
timeslices.  □ 

Cycles  of  nodes  become  atomic  units — if  timeslices  define  perceivable  moments,  then  cyclic 
sets  can  never  be  perceived  as  only  partially  complete.  This  observation  suggests  that  transaction 
behavior  may  fit  nicely  into  the  framework  of  distributed  time.  (Section  7.2  discusses  this  topic 
further.) 

Suppose  the  clock  primitives  for  the  global  order  extend  to  the  STRONG  version.  A  process 
may  obtain  a  snapshot  guaranteed  to  be  a  global  state  with  neither  send  nor  receive,  nor  message 
in  transit,  simply  by  carrying  out  a  partial  order  time  snapshot  protocol,  modified  by  substituting 
clock  primitives  for  the  new  model. 

Unfortunately,  the  cyclic  STRONG  .PARTIAL.  ORDER  model  no  longer  satisfies  the  convenient 
information  properties  of  Section  2.4.3.  For  example,  suppose  process  p  executes  a  send  event  5 


^According  to  this  fix,  all  messages  are  received,  only  some  messages  are  not  received  until  the  end  of  time. 
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of  a  message  that  is  later  received  by  process  q  in  receive  event  R.  Process  p  at  5  may  know  that 
R  — ►  S  in  STRONG -PAKriAL. ORDER — however,  process  p  may  not  know  anything  else  about  R. 

More  precisely,  (strong .partial -ORDER,  timelines  )  is  neither  strongly  monotonic  nor  flow- 
supported  (although  it  at  least  offers  the  advantage  that  processes  can  know  that  unmatched  send 
events  exist,  so  at  least  convergence  can  actually  be  determined).  Specifying  and  implementing 
clocks  for  such  cyclic  models  becomes  rather  tricky.  In  our  original  scheme,  the  PRECEDES 
primitive  has  two  answers.  However,  the  query  “does  A  precede  B  in  STRONG  _  partial  _  ORDER?” 
admits  a  third  answer;  “not  enough  information  yet.” 

The  proper  answer  to  the  new  query  depends  on  when  it  is  asked.  Consequently,  we  need 
a  third  time  model  to  express  the  degree  of  information  flow  for  clock  primitives.^  This  model 
should  fit  between  the  original  partial  order  and  its  composition  with  strong;  the  abstraction 
hierarchy  framework  from  distributed  time  theory  provides  the  necessary  machinery.  This  concept 
of  parameterized  clocks  also  extends  to  handle  the  case  when  various  circumstances  (such  as  faults 
or  malice)  prevent  the  convenient  information  assumptions  in  well-behaved  partial  order  models 
from  holding. 


3.3.2.  A  Harder  Case 

Theorem  2.7  tells  us  that  the  adjusted  timestamp  vector  of  a  node  is  a  timeslice.  One  might  conjec¬ 
ture  that  any  nontrivial  timeslice  is  the  adjusted  vector  of  one  of  its  nodes.  In  fact,  this  conjecture 
is  false — Figure  3.4  shows  a  counter-example.  However,  we  can  still  establish  something  rather 
interesting:  that  adjusted  vectors  uniquely  describe  any  nontrivial  timeslice,  and  that  we  can  obtain 
these  descriptions  by  taking  timeslices  from  higher-level  time  models. 

For  the  snapshot  problem,  these  results  have  two  implications: 

•  A  process  might  obtain  snapshot  X,  but  determine  that  a  different  one  is  necessary.  The 
descriptions  give  a  way  of  quickly  specifying  and  obtaining  a  new  one. 

•  A  process  can  obtain  a  group  of  related  snapshots  by  taking  a  single  snapshot  in  the  higher- 
level  model. 


(Charron-Bost  (CB89]  establishes  a  related  result:  in  partial  orders,  a  bijection  exists  between 
antichains  (i.e.,  partial  timeslices)  and  past-closed  graph  prefixes.  Besides  being  developed  in 
a  different  framework,  our  results  are  distinct  because  past-closed  graph  prefixes  do  not  map 
injectively  to  timeslices.) 


^Section  6.2  explores  these  issues  further. 
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Rgura  3.4  Not  all  timeslices  are  adjusted  vectors.  Timeslice  X  =  {A,  B)  equals 
neither  V*(/l)  nor  V(B).  This  example  disproves  the  conjecture  that  Theorem  2.7 
might  describe  all  timeslices. 


The  Description  Let  (M,  be  a  IVpe  2  parallel  pair  (consistent  and  independent)  that  is 
also  acyclic.  Let  K  be  a  non-trivial  partial  titneslice:  ..  non-empty  set  of  non-extremal  nodes  that 
are  mutually  concurrent.  Let  X  be  2l  non-trivial  timeslice  from  M.  A  generating  subset  of  X  is  a 
Y  C  X  satisfying  the  equation; 


X  =  UV‘(A) 

A€Y 

A  minimal  generating  subset  of  X  is  a  generating  subset  of  X,  with  the  additional  property  that  no 
proper  subset  is  a  generating  subset  of  X. 

The  remainder  of  this  section  establishes  two  key  results: 


1.  Each  timeslice  X  has  a  unique  minimal  generating  subset. 

2.  A  set  K  is  a  minimal  generating  subset  of  a  timeslice  iff  V  is  a  non-trivial  partial  timeslice 
in  a  higher-level  model. 


The  Blocking  Model  Establishing  these  results  hinges  on  drawing  edges  from  a  node  A  to 
node  B  when,  in  the  transitive  global  order,  the  local  predecessor  of  A  precedes  B.  We  can  express 
this  enhancement  itself  as  a  model,  blocked.  The  blocked  model  operates  on  a  graph  by  copying 
it,  and  for  each  cross-process  edge  from  node  A  to  node  B,  adding  another  edge  the  local  successor 
of  A  to  B.  The  name  of  the  blocked  model  derives  from  its  function  (which  we  will  demonstrate 
shortly).  If  A  — >  B  in  blocked  o  M,  then  the  presence  of  B  in  an  M  timeslice  X  blocks  node  A 
from  being  part  of  the  minimal  generating  subset  for  X. 
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Results  When  an  acyclic  parallel  pair  (M,M')  is  Type  2  (that  is,  consistent  and  indepen¬ 
dent),  then  (blocked  o  M,  blocked  q  M')  is  a  parallel  pair  that  is  still  independent,  transitively- 
bounded,  and  acyclic.  However,  this  new  parallel  pair  may  not  necessarily  be  view-complete. 
Consider  a  graph  from  the  partial -Order -TIME  model.  Suppose  at  process  p,  only  a  single  state 
node  A  separates  a  send  node  S  from  a  subsequent  message  node  B.  The  BLOCKED  model  will 
slide  the  S  — >  R  message  edge  up  to  A  — *■  R,  and  the  edge  from  A  to  B  may  not  necessarily 
have  an  externally  equivalent  node. 

Precedence  in  blocked  expresses  when  the  presence  of  one  node  in  a  timeslice  forces  the 
presence  of  another. 

Theorem  3.2  Suppose  (M,  M')  is  an  acyclic  Type  2  (consistent  and  independent) 
parallel  pair.  If  non-extremal  distinct  A  and  B  satisfy  A  ♦-/->  B  in  M,  then 

A — ^  B  in  BLOCKED  o  M  AeV(5) 

Proof  Let  A'  be  the  node  immediately  preceding  A. 

Suppose  A  — y  B  in  BLOCKED  o  M.  Since  A  -/-+  B  in  M,  the  precedence  path  from  Ato  B 
must  have  been  created  by  BLOCKED  moving  ahead  the  in-node  of  a  message  edge.  This  could  only 
have  made  a  difference  when  A'  — ►  B  in  M.  Thus  A'  6  V(B),  hence  A  €  V*(5). 

Conversely,  suppose  A  €  V*(B).  Then  in  M,  A!  — y  B  but  A  -f~y  B.  Thus  A'  must  be  the 
send  event  of  a  message  that  is  received,  so  blocked  will  copy  and  shift  this  message  edge,  giving 
A  — >  B  in  BLOCKED  0  M,  □ 

To  obtain  our  main  results,  we  first  establish  a  condition  on  generating  subsets. 

Theorem  3.3  Suppose  (M,  M')  is  an  acyclic  Type  2  (consistent  and  independent) 
parallel  pair.  Let  X  be  a  non-trivial  timeslice  from  M.  Any  generating  subset  Y  of  X 
satisfies  the  statement: 

VAeX  3B  a  B  in  BLOCKED  o  M 

Proof  Suppose  the  condition  fails  for  A  €  X.  Then  A  ^Y,  and  (from  Theorem  3.2),  for  no 
C  G  y  is  A  €  V’(C).  Thus  A  is  not  in  the  join  of  the  adjusted  timestamp  vectors  of  V',  so  Y  is  not 
a  generating  subset.  □ 


We  then  construct  a  generating  subset  for  a  given  timeslice. 

Theorem  3.4  Suppose  (M,  M')  is  an  acyclic  Type  2  (consistent  and  independent) 
parallel  pair.  Let  X  be  a  non-trivial  timeslice  from  M.  Let  Y  be  the  set  of  BLOCKED  o  M 
sinks  in  X. 


Y  =  {A€X:V5eX,  A  -f-y  B  in  BLOCKED  o  M} 
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Then  K  is  a  generating  subset  of  X. 


Proof  Let  Z  be  the  join  of  the  M  adjusted  timestamp  vectors  of  nodes  in  Y .  For  each  process  p, 
let  Aj,  and  Bp  be  the  p  entries  of  X  and  Z,  respectively.  We  establish  that  Ap  =  Bp  by  considering 
the  two  cases: 

1.  Suppose  Ap  G  Y.  Let  C  be  another  node  in  F,  let  Cp  be  the  p  entry  in  its  M  adjusted 
timestamp  vector.  If  Ap  =  Cp  then  (by  Theorem  3.2),  Ap  — ►  C  in  blocked  o’M,  violating 
the  construction  of  Y.  If  Ap  — >  Cp  in  M  but  Ap  ^  Cp,  then  Ap  — >  C  in  M,  violating  the 
fact  that  X  is  a  timeslice.  Thus  Ap  properly  follows  from  each  Cp,  so  Ap  =  Bp. 

2.  Suppose  Ap  ^  Y.  By  construction  of  Y,  there  exists  a  C  €  F  such  that  Ap  — >  C  in 
BLOCKED  o  M.  Since  Ap  and  C  are  both  members  of  timeslice  X,  Theorem  3.2  gives  that 
Ap  €  V*(C).  If  Z)  ^  C  in  F  has  W’iD)  dominating  V’(C)  at  p,  then  Ap  precedes  or  equals 
the  p  entry  of  V(Z?).  Thus  Ap  — >  D  in  M,  violating  the  fact  that  ^  is  a  timeslice.  Thus 
Ap  =  Bp. 

□ 


We  use  the  condition  from  Theorem  3.3  to  show  that  the  generating  subset  from  Theorem  3.4 
is  unique  and  minimal. 

Theorem  3.5  Suppose  (M,  M')  is  an  acyclic  Type  2  (consistent  and  independent) 
parallel  pair.  Any  non-trivial  timeslice  from  M  has  a  unique  minimal  generating 
subset. 

Proof  Let  X  be  a  non-trivial  timeslice  from  M.  Let  F  be  the  generating  subset  from  Theorem  3.4. 
Theorem  3.3  provides  two  facts: 

1.  Any  generating  subset  F'  of  X  has  F  C  F'. 

2.  Any  proper  subset  of  F  cannot  be  a  generating  subset  of  X. 

Hence  F  is  the  unique  minimal  generating  subset.  Q 

Finally,  we  show  that  being  a  unique  minimal  generating  subset  is  equivalent  to  being  a  partial 
timeslices  under  BLOCKED. 

Theorem  3.6  Suppose  (M,  M')  is  an  acyclic  Type  2  (consistent  and  independent) 
parallel  pair.  Let  F  be  a^t  of  non-extremal  nodes.  F  is  the  unique  minimal  generating 
subset  of  a  timeslice  in  M  iff  F  is  a  partial  timeslice  in  blocked  o  M. 
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Proof  Apply  Theorem  3.5.  The  unique  minimal  generating  subset  of  a  timeslice  X  is  partial 
timeslice  in  BLtXKED  o  M.  Any  not-triviai  partial  timeslice  Y  from  BLOCKED  o  M  is  the  unique 
minimal  generating  set  of 

X  =  UV'(A) 

A^Y 

□ 


Implications  Suppose  (M,  M')  is  an  acyclic  Typ.  ^lent  and  independent)  parallel  pair. 

A  timeslice  containing  k  nodes  in  blocked  o  M  is  a  shorthand  representation  of  2^  timeslices 
in  M.  If  the  clock  primitives  extend  to  handle  queries  about  blocked  q  M,  then  process  p  can 
capture  a  large  class  of  snapshots  in  M  by  asking  for  a  single  one  in  blocked  o  M.  Naturally,  we 
can  establish  dual  results  for  adjusted  rollback  vectors. 


3.4.  Further  Issues 

To  find  out  the  state  of  the  system,  a  process  takes  a  snapshot.  The  fact  that  the  snapshot  does  not 
necessarily  describe  a  real  state  creates  some  complications.  Section  3.4.1  considers  the  problem 
of  resolving  the  parallax  between  inconsistent  states,  and  Section  3.4.2  considers  some  areas  for 
future  work. 


3.4.1.  Resolving  Parallax 

The  fact  that  standard  snapshot  protocols  only  guarantee  consistent  global  states  allows  some 
potentially  difficult  parallax  situations:  two  snapshots  taken  in  the  same  computation  may  not 
mesh.  Process  p  may  take  snapshot  X‘,  process  q  may  take  snapshot  Y.  The  global  states  X  and  Y 
will  each  be  genuine  global  states  in  some  physical  computation  underlying  the  unfolding  partial 
order — but  they  might  not  both  appear  in  the  same  physical  computation.  For  example,  perhaps 
there  exists  a  pair  of  nodes  such  that  at  instant  X,  one  lies  in  the  future  and  one  in  the  past,  but  at 
instant  Y  the  positions  are  reversed.  Figure  3.5  sketches  a  simple  example. 

Distributed  time  theory  gives  us  a  way  to  understand  parallax;  distributed  time  primitives  form 
the  basis  for  a  simple  protocol  to  resolve  parallax. 

Why  Does  Parallax  Occur?  Timeslices  in  a  computation  graph  form  a  sublattice  of  a  divided 
hypercube  in  k  dimensions,  where  k  is  the  number  of  messages.  We  obtain  this  hypercube  by 
drawing  nodes  at  each  point  with  coordinates  from  the  set  {0, 1,2},  and  drawing  an  edge  from 
node  Ni  to  node  N2  when  they  differ  in  exactly  one  coordinate,  and  the  N2  value  there  is  one 


Rgure  3.5  Parallax  occurs  when  snapshots  appear  to  be  inconsistent.  Consider 
the  positions  of  nodes  A  and  B  with  respect  to  timeslices  X  =  V‘(A')  and 
Y  =  Both  timeslices  represent  logically  simultaneous  instants.  However, 

at  instant  X,  A  has  occurred  but  B  has  not,  while  at  instant  Y,  B  has  occurred  but 
A  has  not. 


greater  than  the  N\  value.  Each  dimension  represents  a  message  M,  and  the  coordinate  values 
represent  that  message’s  status:  unsent,  en  route,  or  received. 

Under  the  simplifying  assumption  that  no  two  message  events  occur  simultaneously,  each 
source-sink  traversal  of  this  gr^h  corresponds  to  a  real  time  trace  of  a  computation  generating  this 
grs^h. 

Timeslice  precedence  captures  computation  paths.  Any  timeslice  follows  or  equals  the  minimal 
timeslice  and  precedes  or  equals  the  final;  timeslices  X  and  Y  satisfy  X  -<  K  when  a  computation 
path  exists  from  X  to  Y.  In  general,  the  timeslice  lattice  is  not  a  straightline  graph;  timeslice 
precedence  is  a  not  a  total  order.  Parallax  follows  from  the  existence  of  timeslices  that  are  “con¬ 
current,”  in  the  sense  that  they  are  incomparable  under  the  -<  order.  (This  structure  on  timeslices 
is  reminiscent  of  the  time  model  structui::  on  nodes.) 


Resolving  the  Inconsistency  Once  again,  lattices  come  to  the  rescue.  Suppose  X  and  Y 
are  timeslices  in  a  view-complete,  transitively  bounded  graph.  We  already  know  that  X  and  Y  are 
consistent  cuts.  We  can  directly  establish  that  the  set  of  all  timeslices  Z  such  that  Z  C  X  U  T 
forms  a  finite  lattice:  this  set  is  closed  under  (1  and  U  . 

To  simplify  presentation,  we  assume  without  loss  of  generality  that  X  properly  dominates  Y 
at  each  process.  (In  the  general  case,  we  would  take  the  join  and  meet  of  the  two  timeslices,  and 
restrict  our  attention  the  processes  where  the  values  differ.) 

Clock  primitives  provide  a  basis  for  constructing  a  simple  graph  that  expresses  the  sublattice 
of  timeslices  contained  in  the  set  X  U  y.  Construct  the  graph  G  by  creating  one  node  for  each 
process,  and  drawing  an  edge  from  process  p  to  process  q  ^  p  when  Kp  X  — ►  x,  K .  If  there  are  n 
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processes,  this  construction  takes  O(n^)  operations;  access  to  vector  operations  and  vector  clocks 
can  bring  this  down  to  linear. 

The  graph  G  concisely  expresses  the  sublattice.  To  obtain  a  timeslice  Z,  a  process  follows 
these  steps: 

1 .  Choose  a  node  p  in  G. 

2.  Select  either  Xp  X  or  VpY  for  Z. 

(a)  If  Xp  X,  then  for  each  node  q  that  (transitively)  follows  p,  select  x,  X. 

(b)  If  Xp  Y,  then  for  each  node  q  that  (transitively)  precedes  p,  select  x,  Y. 

Delete  the  nodes  for  which  we  just  selected  values. 

3.  Repeat  until  all  entries  of  Z  are  chosen. 

3.4,2.  Future  Work 

Fully  generalizing  the  protocols  and  primitives  requires  confronting  unresolved  obstacles — and 
also  suggests  interesting  new  structures.  These  issues  provide  topics  for  further  research. 


Convergence  and  Knowability  The  snapshot  protocols  of  Section  3.2  implicitly  assume  that 
a  given  process  has  sufficient  information  to  decide  clock  queries.  As  Section  2.4.3  discusses,  this 
implicit  assumption  may  fail.  We  present  three  such  scenarios: 


•  cyclic  models; 

•  clock  implementation  where  faults  or  malice  or  efficiency  prevent  complete  knowledge;  and 

•  snapshot  queries  about  recent  nodes. 


These  scenarios  require  more  active  consideration  oi  convergence:  when  information  about  the 
past  of  a  node  catches  up  with  the  future  of  a  node.  These  scenarios  also  require  a  more  detailed 
exploration  of  the  knowability  issues  of  Section  2.4.3. 


Observation  Effects  Consider  again  a  snapshot  protocol  that  assumes  all  processes  have  heard 
of  node  A.  If  a  precedence  path  does  not  exist  from  node  A  to  process  q,  should  the  arrival  of  the 
snapshot  query  establish  one?  That  is,  how  should  the  act  of  examining  the  computation  interact 
with  the  computation  itself? 
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Suppose  a  process  p  uses  a  snapshot  protocol  determines  if  a  global  state  satisfying  a  particular 
property  exists.  Process  p  obtains  its  result  at  node  A.  What  should  happen  to  process  p  if  this 
query  fails — but  later  another  process  rolls  back  and  changes  history?  If  a  suitable  timeslice  now 
exists,  the  answer  process  p  received  is  incorrect,  so  node  A  now  depends  on  incorrect  data.  The 
need  to  express  this  dependence  suggests  that,  for  time  models  expressing  perception,  every  node 
examined  in  a  snapshot  search  should  precede  the  response  node.  Is  this  synchronization  desirable? 
Should  we  use  the  abstraction  hierarchy  techniques  of  distributed  time  to  capture  this  influence  in  a 
higher-level  model?  What  should  happen  in  snapshot  protocols  where  the  existence  of  a  sn^shot 
does  not  change,  but  the  actual  snapshot  returned  does? 

Global  States  and  Guaranteed  Pasts  The  abstract  computation  graph  describing  a  com¬ 
putation  induces  a  lattice  of  timeslices.  The  actual  physical  computation  that  occurs  determines 
which  path  through  this  lattice  the  system  actually  takes.  The  limitations  of  distributed  asynchrony 
prevent  the  system  from  ever  finding  out  which  path  this  is;  as  a  consequence,  taking  a  snapshot  to 
determine  some  property  of  the  computational  path  is  difficult  in  the  general  case. 

Recent  work  in  global  state  detection  (e.g.,  [CoMa91,  MaNe91,  MaSa91,  ToGa93,  GaWa94)) 
has  explored  this  area,  both  by  explictly  searching  the  timeslice  lattice  and  by  defining  classes  of 
predicates  (in  particular,  by  relaxing  the  stability  requirement)  that  may  be  examined  by  snapshot 
techniques,  integrating  this  work  into  our  framework  would  be  an  interesting  area  for  future 
research. 

Anotlier  research  direction  comes  from  the  observation  that  although  a  process  p  cannot  find  a 
recent  gl  jbal  state  guaranteed  to  have  occurred,  it  may  have  some  use  for  a  concise  set  { .Yi , . . . ,  .V* } 
of  recen  t  global  states,  such  that  one  of  them  definitely  occurred.  In  the  general  case,  these  sets 
will  be  minimal  graph-cuts  of  the  timeslice  lattice.  With  some  simplifying  assumptions,  these  sets 
have  a  nore  familiar  form:  a  “maximal  set  of  X,  such  that  no  Xi  -<  X," — that  is,  a  timeslice  of 
timeslices.  Can  we  direc  ly  generalize  distributed  time  to  build  higher-level  time  models  whose 
nodes  represent  timeslices  in  lower-level  time  models? 
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Chapter  4 

Optimistic  Roiiback  Recovery 


4.1.  Overview 

4.1 .1 .  The  Basic  Problem 

The  problem  of  rollback  arises  when  a  process  p  in  a  distributed  computation  rolls  back'  to  a 
previous  state.  This  problem  typically  appears  when  providing  fault  tolerance  for  distributed 
computation:  physical  failure  of  process  p  might  force  p  to  roll  back,  since  the  most  recent  state  of 
p  that  can  be  recreated  may  not  necessarily  be  the  state  p  held  when  it  failed.  Some  applications 
might  also  permit  rollback  in  non-failure  scenarios;  for  example,  process  p  might  roll  back  its 
computation  if  p  discovers  that  critical  input  data  was  corrupted.  For  clarity  of  presentation,  this 
chapter  assumes  that  rollback  occurs  because  of  process  failure-,  however,  our  techniques  also  apply 
to  the  more  general  case. 


Failure  and  Recovery  Suppose  process  p  fails  and  recovers  by  restarting  itself  from  an  earlier, 
saved  state.  All  activity  by  process  p  since  it  first  passed  through  this  restored  state  has  been  lost. 
(Figure  4. 1  illustrates  this  scenario.)  If  the  original  execution  of  this  lost  activity  affected  no  process 
other  than  p,  then  the  loss  of  this  activity  can  affect  no  process  other  than  p.  For  example,  suppose 
the  lost  activity  had  been  entirely  internal  to  p,  or  had  included  only  the  receipt  of  messages  (if 
messages  are  not  acknowledged  and  could  also  be  lost  for  other  reasons).  In  this  case,  the  rest  of 
the  system  may  proceed  without  ever  knowing  about  process  p’s  failure  and  recovery. 


Dependence  However,  suppose  the  lost  activity  at  process  p  included  the  send  event  of  a 
message  that  was  received  by  process  q.  Then  the  state  of  process  q  depends  on  activity  at  process 
p  that  led  up  to  the  sending  of  that  message,  but  some  of  this  activity — including  the  send  event 
itself — has  been  rolled  back  due  to  the  failure.  Process  q  has  received  a  message  that,  in  process 
p’s  view  after  recovery,  was  never  sent.  The  computation  at  process  q  after  this  receive  event  must 


“Rollback”  is  the  noun;  “roll  back”  is  the  verb  phrase. 
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also  be  rolled  back  in  order  to  restore  the  system  to  a  consistent  state.  (Figure  4.2  illustrates  this 
scenario.) 


Transitive  Dependence  Distribution  and  asynchrony  may  make  the  situation  even  more  com¬ 
plicated.  For  example,  suppose  process  q  receives  a  message  from  process  p,  but  p  then  rolls  back 
its  send  event.  However,  before  learning  of  process  p’s  rollback,  process  q  sends  a  message  to 
another  process  r.  Then  process  r  depends  on  computation  that  has  been  roiled  back — even  though 
process  r  may  not  have  directly  received  a  message  from  p.  (Figure  4.3  illustrates  this  scenario.) 


Orphans  In  rollback  recovery,  an  orphan  is  an  event  or  state^  that  causally  depends  on  (or  equals) 
computation  that  has  been  rolled  back.  This  terminology  and  the  use  of  partial  .order  .time  to 
express  potential  causality  simplifies  the  above  discussion.  In  Figure  4.1,  nodes  A2  through  A$  are 
orphans,  since  they  depend  on  prior  computation  that  has  been  rolled  back.  Figures  4.2  and  4.3 
show  orphans  at  other  processes:  the  rollback  in  Figure  4.2  causes  nodes  and  B4  to  become 
orphans;  the  rollback  in  Figure  4.3  causes  nodes  C3  and  C4  to  become  orphans,  along  with  nodes 
B3  through  B^. 


4.1.2.  Further  Issues 

Delayed  Messages  Rollback  can  also  give  rise  to  some  pathological  scenarios.  For  example, 
the  lost  activity  at  process  p  may  include  the  send  event  of  a  message  to  process  q  that,  due  to 
network  delays,  does  not  arrive  at  q  until  after  p  has  rolled  back  and  the  system  appears  to  have 
recovered.  Accepting  this  message  may  cause  process  q  to  become  an  orphan,  as  Figure  4.4  shows. 
However,  addressing  this  problem  by  blindly  discarding  all  messages  sent  before  rollback  may 
lead  to  discarding  valid  messages,  as  Figure  4.5  shows. 


failure 

p: 


Figure  4.1  The  problem  of  rollback  arises  when  a  process  fails  and  restarts  from 
an  earlier  state.  Here,  process  p  fails;  it  recovers  at  state  by  restarting  from  the 
state  from  Ai,  A  large  “X”  marks  each  node  that  has  been  rolled  back. 


^The  literature  sometimes  extends  the  notion  of  orphan  to  include  processes  (whose  current  state  is  an  orphan)  and 
messages  (whose  send  events  are  orphans). 
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Figure  4.2  The  problem  of  rollback  becomes  complicated  when  another  process 
depends  on  events  and  states  that  the  failed  process  has  rolled  back.  Here, 
process  p  has  failed  and  restored  state  Ai.  However,  process  q  at  B3  has  received 
a  message  whose  send  event  has  been  rolled  back.  Hence,  and  B^  depend  on 
computation  that  has  no  longer  happened.  Consequently,  B^  and  B^  should  also 
be  rolled  back. 


p: 


p: 


r. 


Figure  4.3  Transitive  dependence  further  complicates  rollback.  Process  p  has 
failed  and  rolled  back.  Process  r  depends  on  computation  at  process  q  that  in  turn 
depends  computation  that  process  p  has  rolled  back.  Thus,  the  failure  at  process 
p  makes  it  necessary  for  process  r  to  roll  back,  even  though  process  r  has  never 
received  any  message  directly  from  process  p. 
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Rollback  with  Modified  Replay  After  a  process  p  tolls  back  and  restores  some  earlier  state 
A,  it  has  a  number  of  options.  Process  p  could  re-execute  the  same  computation  beginning  at  state 
A  that  it  originally  executed.  Alternatively,  process  p  could  intentionally  execute  a  computation 
beginning  at  state  A  that  differs  at  some  point  from  its  original  execution;  this  approach  is  termed 
rollback  with  modified  replay.  For  example,  perhaps  process  p  alters  its  activity  in  order  to  avoid 
the  conditions  that  led  to  the  failure,  or  perhaps  process  p  rolled  back  explicitly  to  take  another 
course  of  action  (rather  than  to  recover  from  failure). 


Concurrent  Rollbacks  The  possibility  that  rollback  may  occur  asynchronously  raises  some 
questions: 

•  Multiple  processes  could  initiate  rollback  concurrently — ^perhaps  to  recover  from  the  same 
failure,  or  perhaps  to  recover  from  different  failures.  Can  the  recoveries  be  merged?  If  not, 
which  recovery  is  performed  first?  Do  the  others  still  need  to  be  performed? 

•  A  process  might  fail  and  initiate  rollback  before  recovery  from  some  earlier  rollback  at 
another  process  is  complete.  Can  these  two  recoveries  proceed  concurrently? 


4.1.3.  Rollback  Recovery  Protocols 

Beyond  recovering  the  system  when  one  or  more  processes  fail,  rollback  recovery  protocols  have 
several  implicit  goals: 

•  minimizing  the  computation  lost  due  to  failure  (e.g.,  the  interval  from  the  original  execution 
of  the  restored  state  to  the  failure); 


Figure  4.4  Successful  recovery  protocols  need  to  consider  delayed  messages. 
Believing  the  system  is  recovered  and  blindly  accepting  this  message  will  cause 
process  q  to  become  an  orphan. 
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Rguro  4.5  Fixing  the  problem  of  Figure  4.4  by  discarding  all  messages  sent 
before  rollback  can  lead  to  discarding  valid  messages.  Here,  process  q  should 
accept  the  delayed  message — even  though  the  message  was  sent  by  process  p 
before  recovery. 


•  minimizing  the  computation  wasted  due  to  rollback  (e.g.,  the  number  of  surviving  processes 
that  must  be  rolled  back,  the  number  of  times  they  must  roll  back,  the  delay  before  which 
they  begin  their  rollbacks,  and  the  amount  of  rolled-back  computation  that  did  not  depend 
on  computation  lost  due  to  the  failure);  and 

•  minimizing  the  overhead  of  the  recovery  protocol  during  failure-free  execution. 

Checkpointing  One  approach  to  recovery  is  based  on' checkpointing:  processes  periodically 
checkpoint  their  local  state  to  stable  storage.  Rollback  recovery  protocols  based  on  checkpointing 
(e.g.,  [BhLi88,  BCS84,  Ci89,  EJZ92,  KoTo87,  LeBh88,  LNP^])  organize  local  checkpoints  into 
system-wide  global  checkpoints,  and  recover  from  failure  by  rolling  back  all  processes  to  one 
of  these  recovery  lines.  Protocols  use  varying  degrees  of  synchronization  in  establishing  global 
checkpoints.  Using  too  little  synchronization  permits  pathological  scenarios  where  a  single  failure 
could  lead  to  the  domino  effect  [Ra75,  Ru80]  in  which  all  processes  are  forced  to  roll  back  to 
their  initial  states  regardless  of  the  amount  of  progress  made  before  the  failure.  Careful  use  of 
synchronization  avoids  the  domino  effect.  Nevertheless,  checkpointing-based  recovery  wastes 
computation  by  rolling  back  beyond  the  theoretical  minimum.  Processes  that  have  dependence  on 
the  failure  must  (in  general)  roll  back  computation  that  occurred  before  dependence  was  established; 
processes  that  have  no  dependence  may  also  need  to  discard  their  progress  and  roll  back. 

Message  Logging  and  Replay  Another  approach  to  recovery  is  based  on  message  logging 
and  replay.  Processes  log  their  received  messages  and  occasionally  checkpoint  their  local  state. 
Consequently,  processes  may  recreate  any  past  state — not  just  the  ones  saved  as  checkpoints — by 
restoring  an  earlier  checkpoint  and  replaying  the  received  messages  from  the  log.  This  approach 
offers  two  significant  advantages: 
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•  Logging  a  message  is  cheaper  than  recording  a  checkpoint. 

•  Message  logging  reduces  wasted  computation,  since  surviving  processes  only  roll  back 
computation  that  depends  on  the  computation  lost  to  failure. 


Pessimistic  rollback  protocols  (e.g.,  [BBG83,  BBGH89,  ElZw92,  JoZw87,  PoPr831)  syn¬ 
chronize  message  logging  with  the  underlying  computation.  For  example,  a  process  may  not 
proceed  beyond  the  receipt  of  a  message  until  that  message  is  successfully  logged  to  stable  storage. 
Pessimistic  protocols  simplify  recovery,  since  a  surviving  process  never  depends  on  computation 
lost  due  to  process  failure.  However,  the  logging  synchronization  needed  by  pessimistic  protocols 
leads  to  decreased  performance  [Jo89]. 

Optimistic  rollback  protocols  (e.g.,  [Jo89,  JoZw90,  Jo93,  PeKe93,  SiWe89,  StYe85,  Zw88]) 
buffer  received  messages  in  volatile  storage,  and  asynchronously  log  them  to  stable  storage.  A 
process  may  proceed  beyond  the  receipt  of  a  message  before  the  message  is  successfully  logged. 
These  protocols  optimistically  bet  that  a  process  will  not  fail  before  the  logging  of  its  received 
messages  is  complete.  However,  a  failure  at  a  process  that  has  not  finished  logging  may  create 
orphans  at  other  processes.  Consequently,  optimism  complicates  recovery,  since  protocols  must 
be  able  to  detect  and  eliminate  orphans  throughout  the  system.  However,  optimistic  protocols  are 
cheaper  during  failure-free  operation. 


4.1 .4.  Asynchronous  Optimistic  Roiiback  Recovery 

This  chapter  uses  the  framework  of  distributed  time  to  consider  optimistic  rollback  recovery. 
Optimistic  protocols  already  have  low  failure-free  overhead.  Our  tools  for  time  abstraction  allow 
us  to  improve  on  previous  work  by  simplifying  the  task  of  recovery. 

Most  optimistic  rollback  rollback  protocols  require  synchronization  in  recovery.  However,  the 
more  decentralized  a  distributed  protocol  is,  the  better  its  potential  for  exploiting  the  advantages  of 
distribution  (e.g.,  concurrency)  and  being  robust  against  the  disadvantages  (e.g.,  asynchrony  and 
unreliable  networks).  Strom  and  Yemini  (StYe85]  initiated  the  area  of  optimistic  rollback  recovery 
and  presented  the  most  asynchronous  protocol  prior  to  ours. 


Strom  and  Yemini  In  the  Strom  and  Yemini  protocol,  processes  use  timestamp  vectors  to  track 
dependency.  When  a  process  rolls  back,  it  begins  a  new  incarnation  and  sends  announcements  to 
the  other  processes.  (These  announcement  messages  are  not  part  of  the  failure-free  computation, 
and  thus  do  not  carry  dependency.)  When  a  process  receives  a  rollback  announcement,  it  uses 
its  timestamp  vector  to  determine  if  it  is  currently  an  orphan;  if  so,  this  process  rolls  back  to  its 
maximal  non-orphan  state  by  restoring  an  old  checkpoint  and  replaying  its  received  m<*.ssages  until 
a  message  is  reached  whose  send  event  is  an  orphan.  A  process  receiving  a  rollback  announcement 
also  saves  the  incarnation  start  information  from  the  announcement  for  use  in  subsequent  vector 
sorting;  delayed  announcements  may  require  non-faulty  processes  to  block. 
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Strom  and  Yemini  do  not  require  processes  to  synchronize  during  recovery.  This  asynchrony 
offers  several  advantages: 

•  processes  can  recover  without  the  delay  of  synchronization; 

•  recovery  from  concurrent  failures  can  proceed  concurrently;  and 

•  once  initiated,  recovery  can  sometimes  pioceed  despite  network  partitions. 

However,  the  Strom  and  Yemini  protocol  has  a  significant  disadvantage:  a  single  failure  at  one 
process  may  lead  to  6(2”)  rollbacks  (where  n  is  the  number  of  processes  in  the  system).  This 
behavior  occurs  because  an  orphan  state  at  a  surviving  process  r  may  depend  on  the  lost  computa¬ 
tion  through  multiple  paths:  directly  from  the  failed  process,  and  indirectly  through  intermediate 
processes.  Even  with  its  assumption  of  FIFO  message  ordering,  Strom  and  Yemini’s  protocol  may 
generate  failure  announcements  in  such  a  way  that  process  r  rolls  back  in  response  to  the  rollbacks 
of  intennediate  processes,  and  then  in  response  to  the  rollback  of  the  failed  process.  Figure  4.6 
shows  a  simple  scenario  in  which  process  r  rolls  back  twice  in  response  to  a  single  failure  at  process 
p;  Section  4.3.6  presents  an  inductive  construction  showing  an  exponential  number  of  rollbacks. 


Distributed  Time  The  framework  of  distributed  time  allows  us  to  talk  about  time  abstraction 
on  multiple  levels: 

•  We  can  use  one  level  of  partial  order  time  to  describe  potential  causality  in  the  failed  com¬ 
putation. 

•  We  can  use  another  level  of  partial  order  time  to  describe  potential  knowledge  in  the  recovery 
computation. 

Using  timestamp  vectors  for  both  levels  allows  us  to  build  an  orphan  test  exploiting  all  potential 
information.  This  ability  directly  leads  to  an  optimistic  recovery  protocol  that  provides  completely 
asynchronous  recovery  but  requires  surviving  processes  to  roll  back  at  most  once  in  response  to 
the  failure  of  any  process.  (Figure  4.7  provides  a  rough  sketch.) 

Our  new  recovery  protocol  improves  on  Strom  and  Yemini’s  work  by  reducing  the  worst  case 
from  exponential  to  constant,  and  improves  on  other  optimistic  recovery  protocols  by  requiring  no 
synchronization  during  recovery.  Our  protocol  also  provides  additional  flexibility:  messages  need 
not  be  FIFO,  and  no  extra  messages  need  to  be  transmitted.  Further,  developing  our  protocol  in 
the  framework  of  distributed  time  allows  transparent  integration  with  other  applications  based  on 
partial  order  time,  and  transparent  protection  against  clock-based  security  and  privacy  attacks.  Like 
other  optimistic  approaches,  our  protocol  does  not  require  any  process  to  roll  back  computivtion  that 
does  not  depend  on  lost  computation  at  the  failed  process.  Table  II  presents  a  table  comparing  our 
protocol  to  three  principal  optimistic  rollback  protocols,  and  to  a  sample  checkpointing  protocol. 
(Section  4.3.6  provides  a  fuller  discussion  of  these  protocols.) 
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This  Chapter  Section  4.2  di«:usses  the  relevance  of  distributed  time  to  rollback  recovery. 
Section  4.3  presents  our  new  protocol.  Finally,  Section  4.4  uses  distributed  time  to  derive  a  general 
framework  for  rollback  problems  and  recovery  protocols.  (Chapter  5  and  Chapter  6  will  explore 
the  security  issues.) 

We  presented  a  preliminary  version  of  our  new  protocol  in  an  earlier  publication  [SJT94]. 


4.1.5.  Assumptions 

Recoverability  Section  4.1.1  and  the  remainder  of  this  chapter  implicitly  assume  complete 
recoverability:  each  state  at  every  non-faulty  process  can  be  recovered. 


P- 
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Figure  4.6  The  Strom  and  Yemini  protocol  may  cause  surviving  processes  to  roll 
back  multiple  times  in  response  to  a  single  failure.  This  diagram  shows  how  one 
failure  at  process  p  causes  process  r  to  roll  back  twice.  Process  p  fails  and  rolls 
back  Az  through  A^.  This  failure  makes  process  q  an  orphan  (since  q  depends  on 
the  lost  computation  via  path  Pi)  and  also  makes  process  r  an  orphan  (directly 
through  path  Pz,  and  indirectly  through  paths  Pi  and  P3).  When  process  q  receives 
p’s  announcement  about  Az,  q  rolls  back  to  its  most  recent  state  that  does  not 
depend  on  Az.  Unfortunately,  ^’s  announcement  may  arrive  at  process  r  before 
p's  announcement  does.  When  process  r  receives  q's  announcement  about  D^,  r 
rolls  back  to  its  most  recent  state  that  does  not  depend  on  £3.  Process  r  does  not 
know  that  its  restored  state  is  still  an  orphan  until  after  the  delayed  p  announcement 
arrives  (at  C%). 
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Figure  4.7  Using  two  levels  of  partial  order  time  allows  asynchronous  recovery 
while  avoiding  the  problem  of  multiple  rollbacks.  This  diagram  roughly  sketches 
the  principles  involved  in  our  new  protocol.  Solid  arrows  indicate  both  poten¬ 
tial  causality  in  the  failed  computation  and  potential  knowledge  in  the  recovery 
computation.  Dashed  arrows  indicate  only  potential  knowledge  in  tfie  recovery 
computation.  As  in  Figure  4.6,  process  p  fails  and  rolls  back  A2  through  A5.  Thus 
A6  logically  succeeds  A\  rather  than  As',  hence  the  dashed  edge  from  As  to  .46 
and  ihe  solid  edge  from  Ai  to  A(,.  Process  p  sends  an  announcement  to  process 
q]  since  this  announcement  is  not  part  of  the  underlying  computation,  we  use  a 
dashed  edge.  Process  q  rolls  back,  and  sends  an  announcement  that  process  r 
receives  at  C^.  Via  dependence  path  P,,  C2  depends  on  lost  A2  and  is  an  orphan. 
However,  via  knowledge  path  P2,  process  r  at  Ce  is  potentially  aware  that  .42  has 
been  lost.  Comparing  timestamp  vectors  across  partial  orders  allows  process  r  at 
Cf,  to  determine  that  C2  is  an  orphan.  Thus,  unlike  Figure  4.6,  process  r  rolls  back 
far  enough  the  first  tinie. 
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Table  II  The  distributed  time  protocol  for  rollback  recovery  compares  favorably  to 
previous  work  in  many  aspects,  its  principal  drawback  is  timestamp  size,  since  the 
protocol  requires  vector  clocks  for  two  levels  of  partial  order  t^me. 


As  we  have  discussed,  optimistic  recovery  protocols  typically  provide  complete  recoverability 
of  states  at  non-faulty  processes  by  asynchronously  taking  local  checkpoints  at  each  process,  and 
by  asynchronously  logging  the  messages  each  process  receives.  To  restore  a  state  A,  a  process  p 
rolls  back  to  its  most  recent  checkpointed  state  preceding  or  equaling  A,  and  then  re-executes  its 
computation  (replaying  received  messages  from  its  log)  until  it  reaches  state  A.  This  approach 
requires  that  process  execution  be  piecewise  deterministic  (that  is,  deterministic  between  message 
receive  events).  When  a  process  fails,  it  may  lose  recent  logging  information,  since  logging 
proceeds  asynchronously  with  the  underlying  computation.  (This  fact  distinguishes  optimistic 
recovery  from  pessimistic  recovery.) 

This  logging  and  replay  approach  can  be  extended  to  nondeterministic  execution  by  having 
processes  treat  nondeterministic  influences  as  incoming  messages  [ElZw92,  Jo93].  For  example, 
if  a  process  state  enables  a  transition  to  multiple  states,  the  process  might  asynchronously  log  the 
index  of  the  choice  that  is  made.  Our  automata  model  of  Section  2.2. 1  permits  each  state  transition 
at  a  process  to  be  non-deterministic.  The  extended  logging-and-replay  approach  would  provide 
complete  recoverability  for  our  model,  and  for  the  protocols  presented  in  this  chapter.  With  some 
modiflcations,  the  simpler  piecewise  deterministic  model  (with  its  simpler  logging  scheme)  also 
suffices.  Section  4.3.5  discusses  these  modiflcations. 


Commitability  The  examples  in  Section  4.1.1  also  implicitly  assume  that  any  state  or  event  can 
be  rolled  back.  Achieving  this  assumption  in  practice  is  difficult.  Rolling  back  arbitrary  nodes 
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may  be  impossible:  for  example,  interaction  with  the  outside  world  may  lead  to  activity  (e.g., 
launching  a  missile)  that  cannot  easily  be  undone.  Providing  the  ability  to  roll  back  arbitrary  nodes 
may  be  expensive,  since  processes  can  then  never  throw  out  any  checkpoints  or  log  data.  Rollback 
recovery  for  fault  tolerance  requires  only  keeping  sufficient  data  to  restore  the  maximal  recoverable 
system  state;  however,  the  question  still  arises  of  determining  which  data  this  is. 

Due  to  these  problems,  realistic  protocols  also  need  to  consider  stability  and  commitability. 
A  state  or  event  is  stable  when  it  has  been  successfully  logged  to  stable  storage;  a  state  or  event 
is  commitable  when  it  will  never  be  rolled  back  [JoZw90].  If  rollback  only  occurs  to  recover 
from  process  failure,  then  a  node  is  commitable  when  every  node  in  its  timestamp  vector  is  stable. 
When  a  stable  node  A  becomes  commitable  at  a  process,  the  process  will  never  need  to  recreate  an 
earlier  node,  and  thus  may  discard  all  earlier  log  data  (except  that  which  is  necessary  to  recreate 
A).  Furthermore,  activity  with  potentially  permanent  side-effects  may  proceed  safely.  For  space 
efficiency,  recovery  protocols  should  allow  processes  to  discard  unnecessary  data.  For  example, 
in  the  Strom  and  Yemini  protocol,  each  process  maintains  a  vector  indicating  the  logging  status  of 
the  other  processes,  and  uses  this  vector  to  determine  when  node  is  commitable  (and  thus  when 
previous  log  data  may  be  discarded). 

For  clarity  of  presentation,  our  protocols  do  not  address  the  issue  of  commitability.  However, 
a  vector  solution  similar  to  Strom  and  Yemini’s  easily  incorporates  into  our  framework. 


Failure  Detection  and  Reconfiguration  We  also  do  not  consider  mechanisms  for  processes 
to  detect  failure,  nor  for  selecting  the  physical  site  where  a  failed  a  process  should  restart.  (However, 
since  our  framework  provides  tools  for  hierarchies  of  abstraction,  it  may  simplify  many  issues  in 
process/processor  mapping.) 


4.2.  Rollback  and  Distributed  Time 


This  section  applies  the  framework  of  distributed  time  to  the  rollback  problem.  Section  4.2. 1 
discusses  the  relevance  of  distributed  time.  Section  4.2.2  introduces  the  idea  that  processes  per¬ 
forming  rollback  have  two  levels  of  consciousness — the  system  level  of  a  process  implements  the 
user  level.  Section  4.2.3  builds  a  time  model  for  the  computation  performed  by  the  system  level 
of  processes;  Section  4.2.4  builds  a  time  model  for  the  computation  performed  by  the  user  level. 
Section  4.2.5  introduces  some  notation  for  mapping  between  the  system  level  and  the  user  level. 
Section  4.2.6  discusses  the  mechanics  of  retroactive  change — how  rollback  protocols  might  alter 
the  computation  in  progress.  Section  4.2.7  discusses  how  the  failure-free  virtual  computation  arises 
from  the  user-level  computation. 
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4.2.1 .  The  Relevance  of  Distributed  Time 


Optimistic  rollback  recovery  changes  the  history  that  user  processes  perceive.  Distributed  time 
provides  abstraction  tools  that  apply  on  several  levels: 

•  Dependence  on  Failure  Optimistic  rollback  recovery  permits  orphans  to  exist  at 
processes  other  than  the  process  that  failed.  If  we  have  accurate  timestamps  from  real  time 
and  no  prior  failures  have  occurred,  then  we  might  try  using  real  time  to  test  for  orphans: 
anything  that  occurred  after  the  failure.  This  test  will  detect  all  orphans,  but  has  some 
substantial  flaws.  First,  using  real  time  is  wasteful — many  states  that  did  not  depend  on  the 
failure  will  be  rolled  back  unnecessarily.  Further,  as  Section  2.5.1  observed,  this  approach 
easily  breaks  down  in  realistic  scenarios: 

-  Not  all  processes  may  receive  word  of  the  failure  simultaneously.  Real  time  does  not 
distinguish  between  states  that  depend  on  the  failed  state,  and  states  that  occur  after 
recovery  from  dependence  on  the  failed  state. 

-  Suppose  a  failure  occurs  at  process  q  before  recovery  from  a  failure  at  process  p  is 
complete.  Real  time  is  not  suHiciently  articulate  to  express  the  resulting  nuances.  For 
example,  paerhaps  the  current  node  at  process  r  was  an  orphan  due  to  both  failures,  but 
process  r  rolled  back  in  response  to  process  p’s  failure.  Is  the  current  node  at  r  still  an 
orphan  due  to  ^’s  failure? 

-  Some  state  restoration  mechanisms  require  a  process  to  re-execute  events.  Such  a 
re-executed  event  may  have  a  later  timestamp  than  events  that  it  influenced. 

The  nuances  of  dependence  are  better  captured  by  a  partial  order  time  model  (possibly,  as 
the  last  example  illustrates,  a  flow-virtual  model  that  does  not  follow  directly  from  the  real 
timelines  at  processes). 

•  The  Failure-Free  Virtual  Computation  Recovery  from  failure  changes  the  underlying 
computation.  The  failure-free  virtual  computation  that  appears  to  have  happened  after  re¬ 
covery  is  complete  is  also  expressed  by  a  partial  order — ^but  this  partial  order  differs  from  the 
one  tracking  dependency  in  the  failed  computation.  For  example,  suppose  we  wanted  to  use 
the  results  of  Chapter  3  to  take  a  snapshot  from  the  logical  past  of  the  virtual  computation, 
or  suppose  we  need  to  recover  from  a  second  failure.  Applying  protocols  based  on  partial 
order  time  to  the  failure-free  virtual  computation  requires  access  to  this  second  partial  order. 

•  The  Recovery  Computation  The  recovery  computation  is  itself  a  distributed  com¬ 
putation,  different  from  both  the  original  failed  computation  and  the  failure-free  virtual 
computation.  The  recovery  computation  is  expressed  by  a  third  partial  order — one  that 
would  be  constructed  by  an  external  observer  who  did  not  know  that  the  system  was  per¬ 
forming  a  recovery  algorithm.  Reasoning  about  the  progress  of  recovery  (e.g.,  “who  knows 
about  what  rollbacks?”)  and  integrating  the  recovery  computation  with  other  applications 
(e.g.,  snapshots)  requires  using  the  recovery  partial  order. 
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A  Single  Framework  Distributed  time  provides  the  tools  to  represent  computation  at  all  the 
levels  of  abstraction  that  arise  when  considering  rollback  recovery.  The  remainder  of  Section  4.2 
will  develop  these  levels  of  abstraction. 


4.2.2.  Bipartite  Processes 

The  most  straightforward  representation  of  the  state  at  a  process  is  as  a  set  of  bits.  However,  our 
distributed  time  theory  introduces  the  abstraction  that  process  clocks  track  temporal  relations.  A 
firewall  limits  the  interaction  between  a  process  and  its  clock  to  formal  queries  and  responses. 
Figure  4.8  illustrates  this  view. 

Both  the  decision  to  roll  back  and  the  inability  to  directly  control  the  state  of  the  network 
highlight  the  need  for  managing  rollback  at  a  process.  This  need  introduces  a  second  firewall  inside 
a  process:  in  order  for  a  simple  process  of  the  form  of  Figure  4.8  to  exist  in  the  virtual  computation, 
it  must  have  with  it  another  process  that  handles  the  management.  Figure  4.9  illustrates  this  revised 
view:  an  implementing  process  supports  the  implemented  process. 

The  implemented  process — the  state  and  action  of  the  process  above  the  firewall  of  Figure  4.9 — 
is  the  user  level  of  that  process.  The  state  and  action  of  the  entire  process  is  the  system  level.  The 
management  state  is  the  portion  of  process  state  exclusively  part  of  the  system  level. 

Defining  rollback  requires  the  use  of  the  Figure  4.8  view  of  a  process.  Implementing  rollback 
requires  the  use  of  the  Figure  4.9  view.  Multiple  levels  of  abstraction  at  a  process  mesh  nicely  with 
multiple  levels  of  abstraction  and  time  (as  the  following  sections  discuss). 


4.2.3.  The  System  Computation 


An  external  observer  who  did  not  know  that  failure  and  recovery  was  occurring  would  be  oblivious 
to  the  process  structure  of  Figure  4.9.  This  observer’s  point  of  view  would  provide  no  distinction 
between  the  implemented  process  and  the  implementing  process. 


Ctoek 


Figure  4.8  Encapsulating  time  services  into  a  clock  module  revises  our  view  of 
process:  we  now  now  can  think  of  the  internal  state  of  the  process  as  separate 
from  the  state  of  the  process  clock. 
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Clock 

Figure  4.9  Managing  the  virtual  existence  required  by  rollback  introduces  another 
firewall:  between  the  implemented  process  and  the  implementing  process. 


We  define 

SYSTEM  -  PARTIAL  .  ORDER 

to  be  the  time  model  obtained  by  applying  partial  .ORDER  .TIME  without  distinguishing  process 
levels.  (That  is,  we  ignore  the  firewall  between  the  implemented  process  and  the  implementing 
process.)  Similarly,  we  define 

SYSTEM  .TIMELINES 

to  be  the  model  obtained  by  applying  TIMELINES  without  distinguishing  process  levels. 

We  use  the  notation  Vsys(A)  to  indicate  the  timestamp  vector  of  a  node  A  a  graph  from 
SYSTEM  -  PARTIAL  .  ORDER. 

The  pair  of  models  (system .partial. ORDER,  system. timelines)  forms  a  Type  3  parallel 
pair — consistent,  independent,  and  strongly  monotonic.  The  possibility  that  process  failure  may 
disrupt  information  flow  prevents  the  pair  from  being  flow-supported,  and  thus  from  being  Type  4, 
Clocks  for  SYSTEM.  PARTIAL  .ORDER  must  be  designed  around  this  limitation.  Section  4.3.4  con¬ 
siders  these  issues. 


4.2.4.  The  User  Computation 

In  this  section,  we  build  a  USER  .partial  .ORDER  model  to  express  the  user-level  computation 
performed  by  the  user  levels  of  processes.  This  construction  is  a  bit  more  complicated  than  the 
construction  in  Section  4.2.3:  we  obtain  the  user  .partial  .order  model  as  the  composition  of 
an  IMPLEMENT  model  with  the  SYSTEM  .partial  .ORDER  model. 


Implementation  The  system  level  of  process  computation  implements  the  user  level.  Building 
the  user  model  requires  defining  this  implementation. 

In  terms  of  ou'  time  models,  a  user  process  does  four  things: 
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•  It  holds  a 


•  It  performs  internal  computation. 

•  It  sends  a  message. 

•  It  receives  a  message. 


The  system-level  computation  of  a  process  implements  these  four  things: 


•  A  user  srarc  node  consists  of  a  maximal  sequence  of  system-level  state  nodes  and  system-level 
transition  nodes  that  do  not  change  the  user  state. 

•  A  user  internal  computation  transition  occurs  when  the  user  state  changes,  due  to  an  imple¬ 
mented  internal  transition. 

•  A  user  send  event  consists  of  a  change  in  user  state  along  with  the  transfer  of  the  message 
to  the  management  state  (the  virtual  send);  the  system  state  subsequently  sends  the  message 
out  as  part  of  a  system  message.  (The  potential  exists  here  for  the  system  process  to  suppress 
the  message.) 

•  A  user  receive  event  occurs  when  the  system  process  receives  a  user  message  and  decides  to 
pass  it  on  to  the  user-level  process.  In  the  virtual  receive,  the  management  state  changes  (to 
reflect  the  dequeuing  of  the  message)  and  the  user  state  changes  (to  reflect  the  receive). 


The  system-level  computation  at  a  process  can  also  perform  rollback:  a  discontinuous  change 
in  the  user  state. 


Nodes  The  preceding  discussion  of  how  the  system-level  implements  the  user-level  directly  tells 
what  nodes  the  IMPLEMENT  model  should  produce  when  it  is  applied  to  a  system  .partial  .order 
graph,  and  what  these  nodes  should  represent.  This  mapping  is  described  in  the  remainder  of  this 
section,  and  is  illustrated  in  Figure  4. 10  through  Figure  4.14. 


Timelines  as  Trees  With  one  exception,  the  logical  ordering  of  user  nodes  follows  from  the 
semantics  of  implementation — thus  the  implement  model  draws  directed  edges  between  consecu¬ 
tive  user  nodes,  and  draws  a  directed  edge  to  each  user  receive  event  from  the  corresponding  user 
send  event. 

The  exception  is  the  rollback  transition.  Rollback  requires  a  user  process  to  restore  an  earlier 
state  and  continue  execution  from  there.  Logically,  the  restored  state  node  becomes  a  sibling^  of 
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Section  4.3.5  considers  the  implications  of  restoring  the  original  node  itself. 


its  original  instance;  unless  rollback  occurs  again,  subsequent  user  nodes  form  a  linear  sequence 
extending  from  this  restored  state  node. 

This  branching  constitutes  a  departure  from  the  linear  timeline  basis  of  parallel  pairs.  The  live 
history  of  a  user  node  consists  of  its  past  in  the  transitive  closure  in  its  process  timetree. 

The  User  Model  The  implement  model  acts  on  system  .partial  .order  graphs  to  abstract 
away  the  implementation  details  of  the  user  computation.  Figure  4.10  sketches  the  production  of 
state  nodes;  Figure  4.11  sketches  internal  transitions;  Figures  4.12  and  4.13  sketch  the  send  and 
receive  events  for  user  messages;  and  Figure  4.14  sketches  the  rollback  transition.  We  define  the 
USER -PARTIAL -ORDER  model  as  a  composition: 

USER -PARTIAL -ORDER  =  IMPLEMENT  o  SYSTEM -PARTIAL -ORDER 

We  define  the  TIMETREES  model  as  user  .partial -ORDER,  less  the  message  edges. 

The  models  (USER -PARTIAL -ORDER, TIMETREES)  form  a  Type  4  (consistent,  independent, 
strongly  monotonic  and  flow-supported)  nonlinear  pair:  a  parallel  pair,  less  the  requirement  that 
process  graphs  be  linear.^  This  will  be  the  only  nonlinear  pair  considered  in  this  thesis. 


Timestamp  Pseudo*vectors  Because  nonlinear  pairs  do  not  place  the  nodes  at  a  process  in 
a  linear  order,  we  cannot  guarantee  that  any  collection  of  nodes  at  a  process  has  a  minimal  and 
a  maximal  element.  This  uniqueness  property  is  key  to  defining  timestamp  vectors  and  rollback 
vectors  for  nodes — without  it,  the  definitions  collapse. 


state 

7u 


state  internal  state 


Figure  4.10  Under  the  implement  map,  each  user -partial  .order  state  node 
(top)  represents  a  maximal  sequence  of  system,  partial  .order  nodes  that  have 
no  user  state  changes  (bottom).  Subscripts  on  the  state  labels  distinguish  the  user 
part  of  process  state  from  the  management  part. 


‘'Section  2.2.8  discussed  nonlinear  pairs. 


78 


internal 


state 

7u 


state 

®u 
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Rgure  4.1 1  Under  the  implement  map,  each  user  _  partial  _  order  internal  node 
(top)  represents  a  system.pariial- order  internal  node  that  implements  a  user 
state  change  (bottom). 


Figure  4.12  To  implement  a  user  .partial  .order  send  (top),  the  system  process 
lets  the  user  process  send  the  message  virtually.  The  system  process  then  takes 
care  of  the  details  of  actually  sending  the  message  (bottom). 


Rgure  4.13  To  implement  a  user  .partial. order  receive  (top),  the  system 
process  receives  the  message  and  forwards  it  to  the  user  process  (bottom). 
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Rgure  4.1 4  A  system  process  performs  a  rollback  transition  by  restoring  an  earlier 
user  state  (bottom).  The  implemented  user  transition  falls  outside  the  normal  rules 
of  transition  of  user  state;  thus  the  new  user  state  is  a  logical  sibling  of  its  earlier 
instance.  We  adopt  convention  that  the  restored  user  state  node  represents  the 
rollback  transition.  A  large  “X”  marks  each  node  that  has  been  rolled  back. 


For  a  given  node  A  in  USER  .partial  .ORDER,  we  can  still  define  a  timestamp  pseudo-vector 
V'(A)  as  the  riKUsTREES-maxima  of  the  live  history  of  A.  The  timestamp  pseudo-vector  will  not 
necessarily  be  a  true  vector,  since  it  may  contain  multiple  nodes  from  the  same  process  (but  from 
different  branches  of  the  timetree).  If  a  timestamp  pseudo- vector  Vi  A)  is  in  fact  a  vector,  we  use 
the  notation  \ustiA). 

(Section  4.2.7  will  discuss  further  properties  of  timestamp  pseudo-vectors,  and  will  observe 
why  generalizing  rollback  vectors  to  rollback  pseudo-vectors  is  difficult.) 


4.2.5.  Mapping  Between  the  System  and  User  Computation 

We  will  need  to  map  between  the  user,  partial,  order  and  the  system,  partial  .order  levels 
of  abstraction.  This  section  introduces  some  tools  for  this  mapping.  We  show  that  user  precedence 
implies  system  precedence  (of  corresponding  nodes);  we  introduce  some  shortcuts  for  graphical 
notation;  and  we  provide  some  clock  primitives  for  processes  to  perform  this  mapping  explicitly. 


Precedence  User  precedence  implies  system  precedence. 

Theorem  4.1  Let  ^  he  a.  system  .partial,  order  graph,  and  let  7  be  the  corre¬ 
sponding  USER -PARTIAL -ORDER  graph.  Let  A(/,  Bu  be  nodes  in  7.  Any  choice  of  As 
from  ( IMPLEMENT,  /3  )(/4)  and  Bs  from  ( IMPLEMENT,  I3){B)  satisfy  the  statement: 

Au  — >  Bu  in  7  =»  As  — ►  Bs  in  0 

Proof  This  follows  from  the  definition  of  IMPLEMENT.  □ 
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Graphical  Shorthand  Theorem  4. 1  implies  that,  when  we  care  about  transitive  precedence 
and  the  node  set  is  unambiguous,  we  may  use  the  same  drawing  to  represent  relations  from  both 
the  SYSTEM.  PARTIAL -ORDER  model  and  the  USER  .partial  .order  model.  Figure  4.15  shows  an 
example.  For  these  drawings,  we  adopt  the  convention  of  dashed  arrows  for  system-only  edges, 
and  solid  arrows  for  edges  that  carry  precedence  in  both  the  system  and  user  models.  System 
precedence  corresponds  to  any  path;  user  precedence  corresponds  to  paths  composed  of  solid 
edges  only. 

Usually,  we  can  build  these  combined  diagrams  showing  USER  _  partial  _  order  nodes.  Potential 
for  ambiguity  arises  when  we  want  to  consider  the  system  activity  that  a  user  node  represents.  We 
are  particularly  interested  in  three  areas: 


•  the  send  event  for  a  system  message  (and  the  decision  to  send  it); 

•  the  receive  event  for  a  system  message  (and  subsequent  processing);  and 

•  the  receive  event  for  a  user  message  that  a  system  process  decides  to  discard  (so  the  message 
becomes  a  system-only  message). 


When  relevant,  we  indicate  the  interesting  sequence  of  SYSTEM  .partial .ORDER  nodes  that  a 
USER. partial. ORDER  node  represents  by  a  “peapod”  drawing  (such  as  Figure  4.16).  As  system- 
only  edges,  SYSTEM -PARTIAL -ORDER  messages  are  indicated  by  dashed  arrows.  This  convention 
fails  to  distinguish  a  system-only  message  from  a  rejected  user  message;  where  relevant,  this 
distinction  will  be  made  clear  in  discussion. 


Primitives  for  Mapping  Some  of  our  protocols  require  processes  themselves  to  map  between 
levels.  We  now  specify  some  explicit  primitives  for  this  task.  (Recall  from  Section  2.4  that 
CUR -GRAPH  returns  the  current  ground- level  computation  graph.) 

Processes  need  to  map  nodes  back  and  forth.  We  define  two  primitives: 


state 

2u 


state  state 
7u  2u 


Figure  4.15  Since  Theorem  4,1  shows  that  user  precedence  implies  sys¬ 
tem  precedence,  the  same  drawing  can  show  both  user  _  partial  .  order  and 
SYSTEM  -  partial  _  ORDER  information.  We  use  the  convention  that  dashed  edges 
carry  precedence  in  system,  partial  .order  only.  This  sketch  shows  that,  after 
rollback,  the  restored  state  system-follows  the  aborted  state. 
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state  state  state 

2u  2u 


Figure  4.16  When  relevant,  a  “peapod”  drawing  reveals  the  implementation  de¬ 
tail  of  a  user  node.  Suppose  upon  rollback,  the  system  process  here  sends  a 
system-only  message.  We  can  indicate  that  in  our  combined  drawing  by  expand¬ 
ing  the  node  for  the  restored  state  into  a  "peapod”  indicating  the  subsequence 
of  interesting  system  nodes — rollback,  state,  send,  and  state — ^that  this  user  node 
represents.  The  dashed  message  arrow  indicates  the  system-only  message. 


•  USER{A)  returns  the  user  node  representing  system  node  A. 

USER(A)  =  JVODE(B,  A  G  (implement, /3)(B),IMPLEMENT(/^)) 

(Here,j3  =  system. partial  ^order(C(/R.GRAPB)). 

•  SYSTEA/(A)  returns  the  set  of  system  nodes  that  user  node  A  represents. 

SYSTEM(A)  =  ( IMPLEMENT,  SYSTEM, PARTIAL -ORDER(Cty/?- C/M PW)  ){A) 

Processes  need  to  work  with  vectors  on  both  levels.  We  define  a  primitive: 

•  USER.  VECTOR(  V)  takes  system  vector  V  and  maps  each  entry  A  to  its  user  version  USER{  A ) . 
Finally,  processes  need  to  work  with  messages  on  both  levels.  We  define  three  primitives: 

•  USER. MESSAGE.  TEST{M)  returns  true  iff  system  message  M  is  also  a  user  message. 

•  USER.MESSAGE(M)  extracts  the  user  message  from  system  message  M. 

•  SEND .EVENT(M ,  M)  returns  the  send  event  in  M  associated  with  message  M. 

These  primitives  form  part  of  our  clock  suite  for  processes.  However,  we  also  use  USER, 
SYSTEM,  and  USER.  VECTOR  as  informal  shorthand  for  the  operations  they  carry  out. 


4.2.6.  Retroactive  Change 

Section  4.2.3  and  Section  4.2.4  introduced  two  time  models  for  optimistic  rollback  recovery. 
Understanding  how  recovery  leads  to  a  failure-free  virtual  computation  arising  from  these  models 
is  critical  to  defining  and  solving  the  rollback  problem.  This  section  explores  these  issues. 
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Failure-Free  Computations  The  goal  of  rollback  recovery  is  to  establish  a  failure-free  virtual 
computation.  To  facilitate  this  work,  we  define  a  failure-free  trace  to  be  a  trace  of  a  system 
consisting  of  processes  with  implemented  state  only,  that  never  fail.  We  define 

FAILURE  _  FREE  _  PARTIAL  -  ORDER 

to  be  the  partial  .order  .time  model  applied  to  failure-free  traces. 


Extracting  Failure-Free  Computations  The  system  ..  partial  .  order  model  constructs  the 
standard  partial  order  for  the  system-level  computation,  and  the  USER. partial. ORDER  model 
constructs  the  dependency  partial  order  for  the  user  nodes.  In  the  terminology  of  distributed  time, 
we  say  that  the  system. partial. ORDER  model  refines  to  the  user. partial. order  model. 

SYSTEM.  PARTIAL -ORDER  t>  USER -PARTIAL -ORDER 

The  system-level  computation  determines  the  user-level  computation. 

However,  if  failure  has  actually  occurred,  then  the  user  .F.MmAL.  order  graph  will  not  be 
a  FAILURE -FREE -PARTIAL.  ORDER  graph,  because  USER -PARTIAL -ORDER  is  constructed  from  the 
TIMETREES  process  structures,  showing  all  rolled-back  computation.  Furthermore,  extracting  a 
FAILURE-FREE.  PARTIAL.  ORDER  graph  from  USER -PARTIAL -ORDER  leads  to  tricky  situations: 

•  If  recovery  proceeds  correctly  but  more  than  one  process  must  roll  back,  then  the  graph  from 
the  USER.  PARTIAL -ORDER  model  will  generate  a  unique  recovered  failure-free  computation, 
but  may  generate  multiple  “older”  computations. 

•  Consequently,  designing  correct  recovery  protocols  (or  even  unambiguously  specifying  the 
rollback  problem)  can  be  difficult.  A  particular  challenge  is  getting  a  distributed  collection  of 
processes  to  agree,  based  on  knowledge  of  one  failure  and  differing  views  on  the  unfolding 
USER -PARTIAL -ORDER  computation,  on  which  failure-free  computation  to  restore. 


An  Example  Consider  the  system  .partial  .order  computation  described  by  Figure  4.17. 
Process  q  decides  to  roll  back  the  send  event  and  establishes  a  copy  of  earlier  state  node 
82-  Process  q  sends  a  message  to  process  p,  who  cooperates.  Process  q  performs  modified  replay, 
and  executes  B5  instead. 

This  example  provides  a  clear  distinction  between  the  old  virtual  user  computation  and  the 
new  virtual  user  computation.  Figure  4.18  shows  the  failure,  free  .partial  .order  computa¬ 
tion  before  recovery;  Figure  4.19  shows  the  FAILURE  .free  .partial  .order  computation  after 
recovery. 

Identifying  the  recovery  period  in  the  USER  .PARTIAL.  ORDER  computation  of  Figure  4.17  is 
also  straightforward.  From  a  real-time  perspective,  recovery  should  begin  at  the  real  time  that 
process  q  first  rolls  back,  and  end  at  the  real  time  t  j  that  process  p  rolls  back.  Distributed  time  lets 
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Figure  4.17  This  combined  drawing  shows  an  example  of  rollback  with  modified 
replay.  After  user  state  process  q  decides  to  roll  back  node  Process  q 
restores  a  copy  of  state  B2,  and  informs  process  p,  who  cooperates  and  then  pro¬ 
ceeds  with  its  own  modified  replay.  The  pair  of  timeline  edges  marked  Y  delineates 
the  transition  from  the  old  computation  to  the  new  computation. 


Af  A2  Aj 


Figure  4.18  This  graph  shows  the  failure-free  virtual  computation 
from  Figure  4.17  before  process  q  initiates  recovery. 


p- 


^1  A'2  ^5  ^ 


9- 


Figure  4.19  This  graph  shows  the  failure-free  virtual  computation 
from  Figure  4.17  after  rollback  recovery  is  complete. 
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US  draw  even  tighter  boundaries:  the  pair  of  timeline  edges  Y  in  Figure  4. 1 7  marks  the  transition 
from  the  old  computation  to  the  new  computation. 

In  this  example,  both  process  p  and  process  q  perform  rollback.  The  user  computation  at  each  is 
a  tree;  Figure  4.20  shows  the  USER -PARTIAL -ORDER  graph.  Before  process  q  initiates  rollback  re¬ 
covery,  the  user  computation  consists  of  /I4,  B4  and  their  pasts  (the  failure  -  free  _  partial  _  order 
graph  of  Figure  4.18).  After  recovery,  the  user  computation  consists  of  to  Bt  and  their  pasts 
(the  FAILURE  -  FREE  -  PARTIAL  -  ORDER  gri^h  of  Figure  4. 1 9).  However,  the  user  -  partial  _  order 
graph  of  Figure  4.20  also  admits  a  third  failure -FREE -PARTIAL -ORDER  graph:  that  determined 
by  Ae,  Ba  and  their  pasts.  Figure  4.21  shows  this  computation,  where  message  M  is  sent  but  never 
received. 

Scenarios  exists  where  the  computation  of  Figure  4.2 1  may  be  the  correct  virtual  user  computa¬ 
tion  arising  from  the  USER -PARTIAL -ORDER  graph  of  Figure  4.20.  Suppose  process  q  decides  that 
its  initial  decision  to  roll  back  node  B2  was  incorrect,  and  wants  to  restore  its  earlier  computation. 
What  computation  should  the  system  establish?  The  most  straightforward  answer  is  to  return  from 
the  recovered  computation  of  Figure  4.19  to  the  older  computation  of  Figure  4.18.  However,  the 
computation  of  Figure  4.2 1  might  be  a  more  reasonable  result:  fewer  nodes  need  to  be  rolled  back, 
and  no  extra  messages  need  to  be  transmitted. 


Questions  Considering  this  example  raises  a  number  of  questions: 

•  How  do  FAILURE  -  FREE  -  PARTIAL  -  ORDER  Computations  arise  from  a  USER  -  partial  -  ORDER 
computation?  When  are  failure -FREE -Partial -ORDER  computations  incompatible?  The 


A'2  A5  Ag 


Figure  4.20  The  system -partial -order  computation  of  Figure  4.17  maps  to  this 
USER -PARTIAL -ORDER  graph.  The  recovery  changed  the  current  frontier  from  Aa.  Ba 

to  Af,,  B(,. 
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Figure  4.21  The  user  .partial  .order  computation  of  Figure  4.20  admits  a  third 
failure-free  virtual  computation:  one  where  message  M  is  sent  but  never  received. 


three  virtual  computations  arising  from  Figure  4.20  each  have  subgraphs  that  are  valid 
FAILURE. FREE. PARTIAL. ORDER  graphs.  Intuitively,  we  reduce  these  myriad  graphs  to  three 
distinct  computations.  Why  these  three?  Why  are  they  distinct? 

•  How  should  we  specify  rollback  problems?  When  a  process  changes  the  computation  in 
progress  by  moving  to  another  branch  in  its  USER. partial. ORDER  tree,  what  overall  change 
in  the  virtual  user  computation  should  result? 

•  How  do  we  design  recovery  protocols?  How  should  separate  processes  agree  on  the  current 
virtual  user  computation? 

•  How  should  we  evaluate  the  performance  of  rollback  protocols?  Our  discussion  suggests 
several  somewhat  independent  parameters: 

-  how  quickly  recovery  completes; 

-  how  many  processes  are  involved  with  recc  ery; 

-  how  many  nodes  need  to  be  rolled  back; 

-  how  many  messages  become  “lost;”  and 

-  how  many  system  messages  need  to  be  sent  as  part  of  recovery. 

What  tradeoffs  exist?  Which  parameters  are  most  important? 


In  Section  4.2.7,  we  begin  answering  these  questions. 


4.2.7.  Validity  and  Consistency 

In  this  section,  we  define  valid  user  nodes,  and  we  show  how  valid  nodes  comprise  consistent 
virtual  computations. 
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Validity  Identifying  a  system-wide  virtual  user  computation  begins  by  selecting  user  nodes  at 
individual  processes.  A  user  node  is  valid  when  (from  its  perspective)  it  is  part  of  a  failure-free 
virtual  computation.  That  is,  user  node  A  is  valid  when  its  live  history  forms  a  past-closed  prefix 
of  a  git^h  generated  by  FAE.URE- FREE. partial. order. 

Theorem  4.2  A  user  node  A  is  valid  iff  its  timestamp  pseudo- vector  V'(A)  is  a 
vector. 


Proof  The  pseudo-vector  V'  is  a  vector  exactly  when  the  past-closure  of  the  live  history  of  A 
touches  at  most  one  branch  in  the  timetree  at  each  process.  □ 


Consistency  A  set  5  of  nodes  in  a  user. partial. order  graph  is  consistent  iff  the  nodes 
could  all  have  been  part  of  the  same  failure-free  virtual  computation.  That  is,  the  graph  formed 
by  taking  the  nodes  in  S  along  with  their  live  histories  forms  a  past-closed  prefix  of  a  graph  from 
FAILURE  .  FREE  .  PARTIAL  .  ORDER. 

Timestamp  pseudo-vectors  provide  a  nice  way  to  describe  consistency. 

Theorem  4.3  A  set  S  of  nodes  in  a  user. partial. order  graph  is  consistent  iff 
each  node  in  S  is  valid,  and  the  HMETREES-maximum  of  the  timestamp  vectors  V(.4) 

(for  all  A  €  5)  is  a  vector. 

Proof  Let  ^  be  the  USER  .partial  .order  graph,  and  /?'  be  the  subgraph  obtained  by  taking  S 
and  their  past-closure.  If  S  is  consistent,  then  0'  is  a  valid  PARTIAL,  order  .time  graph,  so  each 
node  A  £  S  must  be  valid,  and  for  each  p,  the  p  entries  of  the  timestamp  vectors  are  orderable 
within  a  single  branch  of  the  p  timetree.  If  each  A  £  S  is  valid  and  their  timestamp  vectors  join 
to  a  vector,  then  the  past-closure  of  S  touches  exactly  one  branch  in  each  timetree,  and  so  5  is 
consistent.  □ 

(That  is,  the  timestamp  vectors  of  consistent  nodes  form  a  lattice.) 

Consistency  directly  generalizes  from  validity:  the  singleton  set  {A}  is  consistent  iff  the  node 
A  is  valid.  Consistency  of  sets  also  builds  in  a  nice  way:  a  set  5  of  valid  nodes  is  consistent  iff 
each  pair  of  nodes  in  S  is  consistent. 

Some  straightforward  approaches  to  describing  consistency  actually  fail.  For  example: 

•  A  set  5  is  not  necessarily  consistent  if  the  USER  .PARTIAL.  ORDER- maxima  of  its  past 
form  a  vector.  (Figure  4.22  provides  a  counterexample.)  Using  ttmetrees  rather  than 
USER. PARTIAL. ORDER  Ordering  is  important  if  one  branch  of  a  tree  can  develop  a  depen¬ 
dence  on  another  branch. 
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•  A  set  5  is  not  necessarily  inconsistent  even  if  it  is  not  TiMETREES-dominated  by  a  set  of 
consistent  leaves.  (Figure  4.23  provides  a  counterexample.)  Thinking  about  computations 
as  arising  hrom  a  set  of  process  incarnations — maximal  root-leaf  branches — leads  to  this 
incorrect  description. 

•  While  consistency  of  a  set  follows  from  pairwise  consistency,  evaluating  whether  a  node  A 
at  process  p  is  consistent  with  a  node  B  at  process  q  still  requires  system-wide  data — nodes 
A  and  B  may  be  inconsistent  because  they  depend  on  concurrent  branches  of  the  timetree  at 
a  third  process  r.  (The  timestamp  vectors  provide  the  system-wide  data.) 

The  timestamp  pseudo-vector  of  a  valid  node  A  marks  the  lower  bound  of  the  events  concurrent 
and  consistent  with  A — the  adjusted  timestamp  vector  of  a  valid  node  is  the  minimal  consistent 
timeslice  containing  that  event.  The  asymmetry  of  time  in  USER  _  partial  _  ORDER  makes  it  difficult 
to  define  a  similar  “rollback  pseudo- vector”  having  a  similar  property. 

Generation  A  failure-free  virtual  computation  arises  from  a  user  .partial  .order  graph 
through  consistency.  A  set  of  consistent  nodes  in  the  USER  .PARTIAL.  ORDER  graph  determines 


S 

Figure  4.22  Even  if  the  user. partial .ORDER-maxima  of  the  past  of  a  set  forms 
a  vector,  the  set  itself  may  not  be  consistent.  This  graph  shows  a  counterexample: 
the  past  of  the  mutually  concurrent  vector  S  is  not  a  valid  prefix  of  a  failure-free 
virtual  computation. 


Rgure  4.23  Not  being  dominated  by  a  consistent  leaf  vector  does  not  imply  in¬ 
consistency.  This  graph  shows  a  counterexample:  the  vector  S  is  consistent,  even 
though  the  only  dominating  leaf-vector  S'  is  not  consistent. 


a  past-closed  prefix  of  a  failure  .free,  partial,  order  graph — the  nodes,  along  with  their  live 
histories.  When  two  consistent  sets  are  distinct  but  have  a  consistent  union,  then  these  sets  repre¬ 
sent  different  intermediate  versions  of  the  same  computation.  The  three  virtual  computations  we 
extracted  from  Figure  4.17  arise  from  the  three  maximal  distinct  consistent  sets. 

The  goal  of  optimistic  rollback  recovery  to  restore  consistency  to  the  system  computation:  to 
ensure  that  the  current  user  nodes  at  the  processes  form  a  consistent  set. 


4.3.  Asynchronous  Optimistic  Roiiback  Recovery  Using 
Distributed  Time 


The  distributed  time  framework  developed  in  this  thesis  provides  tools  for  reasoning  about  multiple 
levels  of  time  relations,  for  designing  protocols  in  terms  of  these  relations,  and  for  considering 
independently  the  inherent  security  and  privacy  risks.  Section  4.3  uses  this  framework  to  build 
a  new  optimistic  rollback  recovery  protocol.  The  heart  of  the  protocol  is  a  simple  procedure 
for  processes  to  determine  exactly  when  a  given  state  or  event  is  an  orphan.  The  design  and 
the  correctness  of  this  procedure  follow  directly  from  explicitly  tracking  both  the  partial  order  of 
causal  dependency  and  the  partial  order  of  rollback  knowledge.  This  procedure  is  complete  in  that 
it  reports  no  false  negatives.  It  thus  allows  completely  asynchronous  recovery  while  also  ensuring 
that  each  process  rolls  back  at  most  once  to  recover  from  any  failure — and  that  processes  that  do 
not  depend  on  the  failure  need  not  roll  back  at  all. 

This  protocol  thus  substantially  improves  on  previous  optimistic  rollback  recovery  protocols. 


Section  4.3.1  provides  an  overview  of  this  work.  Section  4.3.2  discusses  the  orphan  detection 
test.  Section  4.3.3  presents  the  complete  protocol.  Section  4.3.6  compares  our  protocol  to  previous 
work. 


4.3.1.  Overview 

Rollback  recovery  requires  determining  which  states  and  events  have  been  potentially  influenced 
by  lost  activity.  Many  existing  protocols  use  some  form  of  partial  order  time  (either  implicitly  or 
explicitly)  to  track  this  potential  dependence.  However,  by  dispensing  with  formal  coordination, 
asynchronous  rollback  recovery  also  requires  the  ability  to  reason  about  and  track  potential  knowl¬ 
edge  of  failures  and  restarts.  This  activity  itself  is  an  asynchronous  distributed  computation,  and 
is  thus  also  trackable  using  partial  order  time.  However,  this  partial  order  differs  from  the  partial 
order  of  events  visible  within  the  user’s  computation.  For  rollback  recovery,  potential  knowledge 
at  the  system  level  is  not  the  same  as  causal  dependency  at  the  user  level.  For  example,  suppose 
process  q  learns  that  its  current  state  A  depends  on  a  lost  state.  Process  q  rolls  back,  and  then  enters 
state  B.  Although  a  knowledge  path  exists  from  state  A  to  state  B,  no  causal  dependency  path 
exists. 

For  effective  implementation  of  asynchronous  recovery,  we  need  to  move  from  viewing  time 
as  a  linear  order  to  viewing  it  as  a  partial  order,  and  we  also  need  to  move  away  from  viewing  time 
as  a  single  level  of  abstraction.  The  framework  of  distributed  time  provides  these  tools,  and  allows 
us  to  build  a  new  protocol  that  cleanly  and  elegantly  solves  the  asynchronous  recovery  problem. 
Distributed  time  enables  us  to  deflne  when  a  state  can  be  known  to  depend  on  a  lost  state,  and  to 
implement  a  test  within  the  protocol  that  fully  utilizes  this  potential  knowledge. 


Advantages  Our  new  protocol  is  the  first  optimistic  rollback  protocol  to  implement  complete!  y 
asynchronous  recovery  effectively.  It  also  compares  favorably  in  many  other  aspects.  We  discuss 
some  of  the  advantages: 

•  Complete  Asynchrony  A  failed  process  can  restart  immediately.  When  a  process 
must  roll  back,  it  can  roll  back  immediately  and  resume  computation  without  additional 
synchronization  with  other  processes. 

•  Maximal  Recovery  Like  other  optimistic  rollback  protocols,  ours  guarantees  that  a  siate 
or  event  is  rolled  back  iff  it  causally  depends  on  the  computation  lost  at  failed  processes. 

•  Minimal  Rollbacks  Our  protocol  also  guarantees  that  a  failure  at  process  p  causes  a 
process  q  to  roll  back  at  most  once.  Processes  that  do  not  depend  on  the  failure  will  not  roll 
back  at  all. 

•  Speedy  Recovery  Suppose  process  q  must  roll  back  because  of  a  failure  at  process  p. 
Process  q  will  roll  back  as  soon  as  any  knowledge  path  is  established  from  p’s  rollback. 
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•  Concurrent  Recovery  Recovery  from  a  process  failure  occurs  as  information  about 
the  failure  propagates.  Basing  recovery  on  information  flow  rather  than  coordinated  rounds 
directly  allows  recovery  from  concurrent  failures  to  proceed  concurrently:  the  recoveries 
merge  and  the  protocol  restores  the  maximum  recoverable  system  state  [Jo89].  (In  particular, 
two  processes  that  each  need  to  roll  back  due  to  two  failures  do  not  need  to  react  to  the  failures 
in  the  same  order.) 

•  Toleration  of  Network  Partitions  Another  side-effect  of  our  asynchronous  approach  is 
that  once  initiated,  recovery  can  proceed  despite  a  partitioned  network.  The  only  processes 
that  need  to  worry  about  recovery  are  those  that  may  causally  depend  on  lost  states.  Since 
each  such  process  can  recovery  asynchronously,  the  processes  on  the  same  side  of  the  network 
as  the  failure  can  recover  immediately.  Processes  on  the  other  side  that  need  to  recover  can 
do  so  when  the  network  is  reunited.  The  remaining  processes  on  either  side  may  proceed 
unhindered.  (However,  this  work  does  not  address  the  problem  of  detecting  failure  in  a 
partitioned  network.) 

•  A  Framework  for  Security  and  Privacy  Tracking  partial  order  time  relations  creates 
security  and  privacy  risks,  since  processes  must  share  and  trust  private  information.  By 
building  our  protocol  in  terms  of  distributed  time,  we  can  provide  transparent  protection 
against  these  risks. 


Drawbacks  Our  new  protocol  does  require  timestamp  information  to  be  maintained,  since 
processes  must  track  relations  in  both  the  user  and  system  partial  orders.  Vector  clock  implemen¬ 
tations  for  these  models  require  one  entry  per  process.  For  SYSTEM -PARTIAL -ORDER,  these  entries 
can  be  a  pair  of  scalars.  A  straightforward  implementation  of  USER -PARTIAL -ORDER  clocks  would 
require  that  the  size  of  the  entry  for  process  p  be  proportional  to  the  number  of  rollbacks  process 
p  has  performed.  However,  optimizations  may  substantially  reduce  this  size.  For  example,  Strom 
and  Yemini  obtain  constant  size  entries  by  transmitting  the  ext-r  i  data  incrementally  (at  the  cost  of 
not  always  having  sufficient  data  to  make  a  comparison).  Using  similar  implementations  will  keep 
timestamp  size  in  our  protocol  within  a  factor  of  two  of  Strom  and  Yemini.  Section  4.3.4  considers 
these  issues  in  more  detail. 


4.3.2.  Orphan  Detection 

In  terms  of  our  time  models,  an  orphan  is  a  user  state  A  such  that  some  rolled-back  user  state 
B  exists  with  B  A  in  the  USER -PARTIAL -ORDER  model.  This  section  discusses  the  central 
roll  that  orphans  play  in  optimistic  rollback  recovery  in  general,  and  asynchronous  approaches  in 
particular.  This  section  then  uses  distributed  time  to  define  when  a  process  can  potentially  know 
that  a  state  is  an  orphan,  and  then  to  build  a  simple  test  that  achieves  this  potential. 
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Preliminaries  We  assume  that  processes  enforce  the  invariant  that  their  user  state  is  always 
valid,  according  to  the  definition  in  Section  4.2.7.  (Section  4.3.3  will  show  this  assumption  is 
easily  satisfied.)  We  also  assume  that  processes  only  restore  states  from  their  current  live  history. 

We  discuss  our  protocol  in  terms  of  distributed  time,  which  describes  computations  as  graphs. 
Consequently,  we  sometimes  informally  identify  a  node  in  a  computation  graph  with  the  state  or 
event  it  represents. 

Discussing  two  levels  of  time  sometimes  make  the  use  of  Roman  letters  for  node  names 
ambiguous.  For  example,  is  the  node  “A”  a  system-level  node  or  a  user-level  node?  Where  a 
simple  name  may  be  misleading,  we  adopt  the  convention  of  using  subscripted  Roman  letters;  e.g., 
“As"  will  be  a  system  node,  and  “Bu"  will  be  a  user  node.  We  adopt  a  similar  convention  for 
messages  and  vectors. 

Why  Orphan  Testing  is  Cruciai  Suppose  p  is  the  process  that  actually  failed.  The  system 
process  at  p  initiates  recovery  by  restoring  earlier  user  state  and  continuing  user-level  execution. 
This  action  causes  one  or  more  live  nodes  at  process  p  to  become  rolled-back.  These  rolled-back 
nodes  are  orphans  by  definition.  However,  the  rollback  action  at  p  may  also  cause  nodes  at  other 
processes  to  become  orphans. 

The  key  to  optimistic  rollback  recovery  is  the  ability  for  processes  to  know  when  nodes  have 
become  orphans.  This  has  two  aspects: 

•  Orphan  Elimination  When  process  q  learns  that  process  p  has  failed,  process  q  must 
determine  if  its  current  user  state  has  become  an  orphan.  If  so,  process  q  must  roll  back — 
preferably  back  to  the  most  recent  state  that  is  now  not  an  orphan.  Processes  thus  need  to 
be  able  to  test  if  their  own  user  nodes  are  orphans.  Figure  4.24  shows  a  detailed  example  of 
this  situation. 

•  Orphan  Prevention  The  rollback  at  process  p  may  have  caused  user  node  Av  at  some 
process  r  to  become  an  orphan.  However,  suppose  Au  was  the  send  of  a  message  to  process 
q.  If  the  user  process  at  q  accepts  the  message,  then  q  will  become  an  orphan.  Thus,  to 
prevent  their  current  user  nodes  from  becoming  orphans,  processes  need  to  be  able  to  test 
if  user  events  at  other  processes  are  orphans.  Figure  4.25  shows  a  detailed  example  of  this 
situation. 

Accurately  testing  for  orphans  is  especially  critical  for  asynchronous  recovery,  with  multiple 
failures  and  minimal  coordination. 

Knowledge  of  Orphans  Suppose  the  system  process  at  q  is  in  node  Bs.  When  could  q  know 
that  a  user  node  Au  is  an  orphan?  We  use  distributed  time  to  answer  this  question. 

In  order  to  test  Au,  the  system  process  at  q  must  be  aware  of  Au~  We  must  have  the  precondition 
that  for  soiree  system  node  As  in  SYSTEM{Au),  As  Bs- 
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failure 


Figure  4.24  Optimistic  rollback  recovery  raises  the  challenge  of  orphan  elimina¬ 
tion:  when  processes  learn  of  failure,  they  need  to  determine  their  most  recent 
node  that  is  not  an  orphan.  In  this  diagram,  all  named  nodes  are  user  nodes. 
Process  p  fails  and  restarts,  and  informs  process  r,  who  restores  a  copy  of  the 
state  at  C2  and  informs  process  q.  When  it  learns  of  process  r’s  rollback,  process 
q  must  decide  if  and  how  far  it  should  roll  back.  Process  q  depends  directly  on 
rolled  back  nodes  at  process  r,  so  a  naive  analysis  would  suggest  rolling  back  to 
before  8$.  In  actuality,  process  q  should  roll  back  to  before  since  that  node  ha-® 
a  direct  dependence  on  rolled-back  node  A4  at  process  p,  whose  failure  triggered 
the  rollback  at  process  r. 
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Figure  4.25  Optimistic  rollback  recovery  also  raises  the  challenge  of  orphan  pre¬ 
vention:  before  formally  receiving  an  arriving  user  message,  a  process  shouid 
determine  if  the  send  event  is  an  orphan.  Again,  all  named  nodes  are  user  nodes. 
Process  p  fails  and  rolls  back,  and  informs  process  r  who  rolls  back  and  restores 
a  copy  of  the  state  at  Cj.  Process  r  then  receives  user  message  M  from  process 
q.  The  send  event  of  M  is  an  orphan,  since  it  user-follows  from  a  rolled-back  node 
^4  at  process  p.  Accepting  this  message  would  cause  the  user  process  at  r  to 
become  an  orphan. 
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For  At/  to  be  an  orphan,  a  rolled-back  user  node  Cy  must  exist  with  Cy  Ay.  From 
Theorem  4.1  and  transitivity,  Cs  Bs  for  any  system  node  Cs  in  SYSTEM  {Cy).  A  path  of 
potential  information  flow  exists  from  the  rolled-back  Cy  to  the  system  process  at  q. 

However,  for  the  system  process  at  q  to  know  that  dependence  on  the  rolled-back  Cy  makes 
Ay  an  orphan,  p  must  know  that  that  Cy  has  been  rolled  back.  If  Ds  is  the  system  node  that  rolled 
back  Cy,  then  we  must  have  Ds  —  Bs  as  well. 


We  summarize  this  formally  with  the  predicate  ORPHAN{Ay,  Bs),  which  is  defined  only  when 
As  Bs  for  some  As  €  SYSTEM{Ay). 


ORPHAN{Ay,Bs)=true 

3  Cy,Ds  such  that 


1.  Cy  :=*  Ay  in  the  USER -PARTIAL -ORDER  model 

2.  Ds  Bs  in  the  system  -  partial  -  order  model 

3.  Ds  rolls  back  Cy 


The  ORPHAN  predicate  does  not  capture  all  the  orphans  in  the  computation — just  all  the  orphans 
that  a  given  system  process  may  potentially  know  are  orphans.  If  process  p  sends  process  q  a  user 
message  but  promptly  rolls  back  without  telling  anyone,  then  process  q  can  not  know  that  the  send 
is  an  orphan.  In  the  SYSTEM  .PARTIAL. ORDER  model,  the  timestamp  vector  on  a  node  Bs  marks 
the  information  horizon  of  that  node.  At  node  Bs,  the  system  process  cannot  know  about  anything 
beyond  this  horizon. 

An  Optimal  Orphan  Test  We  can  use  distributed  time  to  build  a  test  that  captures  the  ORPHAN 
predicate  exactly.  First,  we  build  a  test  that  lets  a  system  process  determine  if  (to  its  current 
potential  information)  a  node  has  been  rolled  back.  Then,  we  generalize  this  test  to  let  a  system 
process  determine  if  a  given  node  depends  on  a  node  that  has  been  rolled  back. 

Let  5s  be  a  system  node  at  process  9.  Let  5s  be  the  p  entry  of  Vsys(  5s),  and  let  5c'  =  USER(Es). 
At  Bs,  process  q  has  no  information  that  Ey  is  not  live.^  Any  5s  rolling  back  Ey  would  system- 
follow  5s;  if  q  could  know  about  such  an  5s,  then  5s  would  not  have  been  the  p  entry  in 
Vsys(5s). 

Further,  process  q  at  5s  can  know  of  a  user  node  Cy  at  process  p  iff  for  some  Cs  G  SYSTEM(  Ci  ), 
Cs  precedes  5s.  Process  q  at  5s  can  sort  these  user  nodes  at  p  into  two  groups: 


•  those  that  user-precede  Ey  (the  user  version  of  the  p  entry  of  ysys(  5s)),  and 

•  those  that  do  not. 


’By  definition,  node  Ey  is  live  iff  its  process  p  has  not  rolled  it  back.  A  live  node  may  be  an  orphan;  knowing  that 
a  node  is  live  is  not  the  same  as  knowing  that  it  is  not  an  orphan.  For  example,  process  s  may  have  rolled  back  an 
ancestor  Gy  of  Ey .  Process  q  may  perceive  that  p  has  not  rolled  back  Ey  but  s  has  rolled  back  Gy ,  and  consequently 
the  currently  live  node  at  p  is  an  orphan. 


95 


Process  q  at  Bs  knows  that  each  node  in  the  second  group  has  been  rolled  back.  Process  q  treats 
each  node  in  the  first  group  as  if  it  were  live,  since  q  has  no  information  otherwise.  For  example, 
suppose  Cu  is  a  user  node  that  q  knows  about  at  Bs-  Consider  the  two  cases: 


•  If  Cu  Eu,  then  either  Cy  has  not  been  rolled  back,  or  information  about  this  rollback 
(which  would  also  roll  back  Eu)  has  not  reached  process  q  at  state  Bs- 

•  If  Cu  —h  Eu,  then  by  Es,  the  system  process  at  p  has  rolled  back  Cu-  This  rollback  event 
must  precede  or  equal  Es  and  thus  Bs,  so  process  q  knows  about  it. 


Figure  4.26  sketches  this  scenario. 

This  reasoning  shows  how  the  system  process  at  q  can  determine  if  a  specified  node  has  been 
rolled  back  (according  to  the  information  potentially  available  to  q).  Since  an  orphan  is  a  node 
that  depends  on  a  rolled-back  node,  this  reasoning  extends  to  allow  q  to  test  for  orphans.  Let  Au 
be  a  user  state  at  process  r  that  process  q  knows  about  at  Bs-  Let  p  be  an  arbitrary  process.  Let 
Es  be  the  p  entry  of  Vsys(5s),  and  let  Eu  =  USER{Es).  Let  Cu  be  the  user-maximal  user  state  at 
process  p  with  Cu  A.u, 


Au  Bj 


y.ys(Es)  ^sys(Fs) 

Figure  4.26  The  system  .partial  .order  timestamp  vector  of  a  system  process 
determines  what  states  it  can  know  to  have  been  rolled  back.  Here,  system  process 
q  at  Es  knows  about  user  nodes  Au  and  Bu  at  process  p,  since  they  lie  within  the 
system-horizon  of  Es-  (That  is,  paths  exists  from  all  nodes  in  SYSTEM{Au)  and 
SYSTEM{Bu)  to  Es-)  At  Es,  process  q  also  knows  that  Bu  has  been  rolled  back, 
since  the  rollback  event  Cs  also  lies  within  the  system-horizon  of  Es-  However,  at 
Es  process  q  believes  Au  is  still  live,  since  the  rollback  event  Ds  that  undoes  it  lies 
beyond  Vsys( Es) — and  thus  outside  the  knowledge  of  Es-  Process  q  does  not  learn 
that  As  has  been  rolled  back  until  Fs- 
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If  Cv  Eu,  then  process  q  at  Bs  perceives  no  rollback  at  p  that  makes  A  an  orphan.  If  this 
relation  holds  for  all  processes  p,  than  process  q  cannot  perceive  that  Ay  is  an  orphan;  according  to 
q's  potential  information  at  Bs,  nothing  that  Ay  depends  on  has  been  rolled  back.  If  this  relation 
fails  for  any  process,  then  process  q  knows  that  Ay  is  an  orphan. 

Vector  clocks  permit  an  elegant  statement  of  this  test.  For  a  vector  Ws  of  system  nodes, 
USER-VECTOR{Ws)  is  the  vector  of  user  nodes  obtained  by  applying  USER  to  each  entry.  Let 
Xusr  denote  the  vector  precedence  relation  under  user  .partial  .order;  vectors  Uy  and  Vy  satisfy 
Uy  ^usr  Vu  when  for  each  p,  the  p  entry  of  Uy  precedes  or  equals  the  p  entry  of  Vy  in  TIMETREES. 

Define  DT. ORPHAN.  TEST  by  the  following  comparison: 

DT.ORPHAN.TEST{Ay,Bs)  =  true  \^^,{Ay)  USER.VECTORi\,y,{Bs)) 

That  is,  take  the  system  timestamp  of  Bs,  map  each  entry  to  its  user  equivalent,  and  do  a  TIMETREES 
vector  comparison  with  the  user  timestamp  of  Ay. 

This  test  captures  all  potential  knowledge  of  orphans. 

Theorem  4.4  If  system  node  Bs  and  valid  user  node  Ay  satisfy  i4s  Bs  for 
some  As  €  SYSTEM{Au),  then  they  satisfy  the  statement: 

ORPHAN{Ay,Bs)  <=»  DT. ORPHAN. TEST{ Ay,  Bs) 


Proof  Let  Ay  occur  at  process  p  and  Bs  at  process  q. 

Suppose  ORPHAN{Ay,Bs)  holds.  Then  at  some  process  r,  there  exists  a  user  node  Cy  and 
system  node  Ds  satisfying  the  following  statements: 

1.  Cy  Ay, 

2.  Ds  rolls  back  Cy,  and 

3.  Ds  Bs- 

Let  Ey  be  the  r  entry  of  Vusti Ay)  (which  exists,  since  Ay  is  valid).  Let  Fs  be  the  r  entry  of 
Vsys(5s),  and  let  Fy  =  USER{Fs).  By  (1)  and  the  definition  of  timestamp  vector,  Cy  =*  Ey. 
Hence,  Ey  rzl  Fy  would  imply  Cy  rzrt  Fy.  By  (2),  Cy  cannot  precede  or  equal  the  user  version 
of  any  system  node  Gs  at  r  with  Ds  Gs  (since  rolled-back  nodes  stay  rolled  back).  By  (3) 
and  the  definition  of  timestamp  vector,  Ds  Fs.  Thus  Cy  cannot  precede  or  equal  Fy,  so  Ey 
cannot  precede  or  equal  Fy.  TTius  DT. ORPHAN. TEST{  Ay,  Bs)  holds. 

Conversely,  suppose  DT. ORPHAN. TEST{ Ay,  Bs)  holds.  Then  there  exists  a  process  r  such 
that  Cy  fails  to  precede  or  equal  Fy,  where  Cy  is  the  r  entry  of  Vusr(^i/),  Fs  is  the  r  entry 
of  Vsys(fis).  and  Fy  =  USER{Fs).  Let  Cs  be  the  minimal  element  of  SYSTEM{Cy).  Since  the 
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definition  of  timestamp  vector  provides  Cu  Ay,  Theorem  4.1  provides  Cs  As  for  any 
As  €  SYSTEM{Au).  Hypothesis  then  provides  Cs  Bs-  The  definition  of  timestamp  vector 
then  provides  that  Cs  Fs.  Since  Cy  neither  precedes  nor  equals  Fy,  there  must  exist  a  Ds 
at  r  in  the  range  Cs  — ►  Ds  Fs  such  that  Ds  rolls  back  Cy.  By  the  definition  of  timestamp 

vector,  Ds  Bs.  Hence  ORPHAN(Ay,  Bs)  holds.  □ 

Figure  4.27  shows  how  DT. ORPHAN. TEST  resolves  the  orphan  elimination  problem  from 
Figure  4.24.  Figure  4.28  shows  how  the  test  resolves  the  orphan  prevention  problem  from 
Figure  4.25. 


4.3.3.  The  Protocol 

We  build  our  protocol  for  optimistic  rollback  recovery  by  having  the  system  processes  maintain 
vector  clocks  for  the  user  and  system  partial  orders,  and  then  using  these  clocks  to  test  for  orphans. 


failure 


A^  A2  A^  A  \  /tfil  Ay 


YUB3) 


Figure  4.27  DT.  ORPHAN.  TEST  allows  accurate  orphan  elimination.  Here,  node 
Bs  is  the  only  named  system  node.  At  Bs,  the  system  process  at  q  can  know  that 
user  node  53  is  an  orphan,  because  in  the  process  p  timetree,  node  A^  (the  p  entry 
of  V„sr(B3))  does  not  precede  node  A^  (the  user  image  of  the  p  entry  of  Vsys(5s)). 
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V^Bs)  ''sys(Cs) 


Rgure  4.28  DT. ORPHAN. TEST  allows  accurate  orphan  prevention.  Here,  node 
Cs  is  the  only  named  system  node.  At  Cs,  the  system  process  at  r  can  know  that 
the  send  Bs  of  user  message  M  is  an  orphan,  because  in  the  process  p  timetree, 
node  Aa  (the  p  entry  of  YusriBs))  does  not  precede  node  Af,  (the  user  image  of  the 
p  entry  of  Vsys(Cs)). 
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/*  the  orphan  test  */ 

function  DT. ORPHAN. TEST{TESTED STATE,  TESTING. STATE) 

V"*- Vusr(  TESTED. STATE) 

USER. VECTOR{\^ys{TESTING. STATE)) 
return  ^COMPARE{V,  W,  (user .PARTIAL- ORDER, timetrees)) 

/*  process  p  receives  system  message  M  */ 
procedure  RECEIVE{Ms) 

/*  set  pointers  to  current  nodes  */ 

As=CUR.NODE{p,  SYSTEM  .PARTIAL  .ORDER) 

Au=CUR.NODE{p,  USER . PARTIAL . ORDER) 

/*  update  SYSTEM -PARTIAL -ORDER  vector*/ 

Ss*-SEND.EVENT{Ms,  SYSTEM. PARTIAL. ORDER) 

Vsys( i45)<-AM;f(Vsys(  As),  Vsys(5s),  (SYSTEM  .partial .ORDER,  SYSTEM  .TIMELINES ) ) 

/*  is  p  now  an  orphan?  */ 
if  DT. ORPHAN. TEST{Au,  As) 

then  roll  back  to  maximal  non-orphan  state 

/*  was  sender’s  current  user  node  an  orphan?  */ 
if  DT. ORPHAN. TEST{USER{Ss),  Bs) 
then  optionally  inform  the  sender 

/*  did  Ms  include  a  user  message?  */ 
if  USER.MESSAGE.TEST{Ms)  then 

Su  *-SEND.EVENT{  USER.MESSAGE{  Ms ) ,  USER  .  PARTIAL  .  ORDER) 

/*  accept  it  if  the  user  send  is  not  an  orphan  */ 
if  DT. ORPHAN. TEST{Su,  Bs) 

then  optionally  inform  the  sender 
else  accept  USER.MESSAGE{Ms) 


Figure  4.29  In  the  distributed  time  protocol,  a  system  process  rolls  itself  back  if 
its  user  state  has  become  an  orphan,  and  then  accepts  a  user  message  only  if  its 
send  is  not  an  orphan.  (We  use  =  to  indicate  assignment  by  reference,  and  ^  to 
indicate  assignment  by  value.) 
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Sending  a  User  Message  Suppose  the  user  process  at  p  decides  to  send  a  message  Mi  to 
the  user  process  at  q.  The  user  process  p  packages  Mu  along  with  \nst{Au)  (where  Ay  is  the  user 
send  event)  and  routes  it  to  the  system  process  at  p,  who  sends  the  package  as  a  system  message. 


Sending  a  System  Message  When  the  system  process  at  p  sends  a  system  message  Ms  to 
the  system  process  at  q,  it  sends  along  the  timestamp  Vsys(Bs)  (where  Bs  the  system  send  event). 
The  system  process  at  p  may  optionally  include  the  user  timestamp  vector  of  USER{Bs). 


Receiving  Meesages  Figure  4.29  shows  the  procedure  used  for  receiving  messages.  Suppose 
the  system  process  at  p  receives  a  system  message  Ms  sent  by  the  system  process  at  q.  The  system 
process  at  p  updates  its  current  Vsy*  vector.  If  DT. ORPHAN.  TEST  indicates  that  p’s  current  user 
node  is  an  orphan,  the  system  process  at  p  performs  rollback.  If  DT. ORPHAN. TEST  indicates 
that  the  user  node  corresponding  to  the  send  of  Ms  is  an  orphan,  the  the  system  process  at  p  may 
optionally  inform  q.  If  Ms  contains  a  user  message  My,  then  the  system  process  at  p  applies 
DT. ORPHAN. TEST  to  the  send  of  My.  If  this  event  is  an  orphan,  p  may  optionally  inform  q\  if 
not,  the  system  process  at  p  lets  the  user  process  at  p  receive  the  message. 

(Suppose  the  send  of  a  user  message  My  user-followed  from  node  Ay  at  process  r,  but  process 
p’s  current  user  node  depends  on  By  at  r,  with  Ay  and  By  concurrent  in  the  timetree.  At  least  one 
of  Ay,  By  must  have  been  rolled  back,  and  the  system  timestamp  on  My  will  carry  that  information 
if  the  system  process  at  p  does  not  already  know  it.  Thus,  this  protocol  automatically  enforces  the 
invariant  that  user  states  are  always  valid.) 


Rollback  A  process  rolls  back  in  two  situations:  when  it  fails,  and  when  it  discovers  it  is  an 
orphan. 

To  roll  back  because  of  its  own  failure,  a  process  restores  the  maximal  recoverable  state  in  its 
live  history. 

To  roll  back  because  it  discovers  it  is  an  orphan,  a  process  must  find  a  state  in  its  live  history  that 
is  not  an  orphan — that  is,  a  state  whose  Vusr  timestamp  still  user-precedes  the  current  Vsy*  vector. 
Clearly  the  initial  state  is  not  an  orphan,  and  clearly  once  a  user  state  is  an  orphan,  subsequent 
user-states  are  orphans.  Thus,  for  a  given  value  of  Vsy*,  there  exists  a  unique  user-maximal  state  in 
the  live  history  that  is  not  an  orphan. 

How  quickly  the  system  recovers  from  rollback  depends  on  how  quickly  the  processes  that  are 
(or  may  become)  orphans  learn  of  the  rollback.  Our  protocol  allows  a  range  of  alternatives,  from 
broadcasting  system-only  messages,  to  letting  the  news  percolate  via  the  system  timestamp  data 
on  user  messages. 
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4.3.4.  Implementation  Details 

Implementing  this  protocol  requires  solving  a  number  of  problems: 


•  Building  vector  clocks  for  SYSTEM  .partial .order  requires  the  ability  to  sort  system  states 
in  terms  of  system  timelines. 

•  Building  vector  clocks  for  user. partial. order  requires  the  ability  to  sort  user  spates  in 
terms  of  user  timetrees. 

•  Performing  DT. ORPHAN.  TEST  requires  the  ability  to  map  system  states  to  user  states  (that 
is,  to  perform  USER). 

This  section  provides  one  possible  solution.  We  build  a  SYS. NAME  data  structure  for  each  system 
node,  a  VSR. NAME  data  structun;  for  each  user  node,  and  show  how  to  perform  the  above  functions 
in  terms  of  these  data  structures. 


The  System  Timeline  System  states  at  a  process  occur  in  consecutive  order,  so  a  simple  scalar 
counter  will  suffice.  The  only  complication  arises  because  failed  processes  will  not  know  how  high 
their  system  state  counter  was  before  failure. 

Consequently,  we  have  each  process  maintain  two  counters:  INCARNATION  .COUNT  tracks 
the  current  incarnation  of  the  process  [StYe85],  and  SYS. COUNT  tracks  the  current  node  within 
that  incarnation.  Startup  initializes  each  counter  to  zero.  The  SYS.  COUNT  counter  is  incremented 
with  each  subsequent  system  node,  unless  the  subsequent  node  is  a  rollback  node,  in  which  case 
SYS. COUNT  resets  to  zero,  and  INCARNATION  .COUNT  is  incremented. 

The  SYS. NAME  for  a  node  consists  of  three  items:  the  INCARNATION  .COUNT  value,  the 
SYS. COUNT  value,  and  the  USR.NAME  of  the  node’s  current  user  state.  To  sort  two  system  nodes 
at  the  same  process,  we  perform  lexicographic  comparison  of  the 

[INCARNATION.  COUNT,  SYS.  COUNT) 
pairs.  To  implement  USER,  we  return  the  USR.NAME  entry. 


The  User  Timetree  Comparing  nodes  in  user  timetrees  is  more  challenging  than  comparing 
nodes  in  system  timelines,  because  trees  do  not  guarantee  that  two  nodes  can  even  be  ordered.  The 
restriction  that  we  must  generate  USR.NAME  values  on-line  further  complicates  matters. 

We  can  begin  by  having  each  process  maintain  a  USR.  COUNT  variable,  initially  zero,  indicating 
the  count  of  the  current  user  node  in  the  currently  live  history.  The  process  increments  USR.  COUNT 
with  each  subsequent  user  node — except  one  obtained  through  rollback. 
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The  USR. COUNT  values  suffice  to  sort  two  user  nodes  within  the  same  live  history.  However, 
we  need  to  be  able  to  determine  if  two  nodes  are  in  the  same  history — that  is,  if  one  is  a  descendent 
of  the  other  in  the  user  timetree. 

Conceptually,  processes  could  track  this  informtuion  by  maintaining  the  path  from  the  root  to 
the  current  node  in  the  user  timetree.  Label  the  nodes  in  the  user  timetree  with  their  USR.  COUNT 
value,  and  the  edge  into  each  node  A  with  the  INCARNATION  .COUNT  value  active  when  that  node 
was  executed.  The  INCARNATION  .COUNT  value  is  fixed  until  rollback  occurs — then  we  create  a 
node  for  the  new  instance  of  restored  state,  and  add  it  as  a  sibling  of  its  earlier  instance.  The  edge 
from  its  parent  to  the  new  node  is  labeled  with  the  new  INCARNATION.  COUNT  value. 

The  path  for  node  A  is  just  the  sequence  of  pairs  of  node  and  edge  labels 

iNo,Eo),...,iNk.uEk.i) 

necessary  to  reach  A  from  the  root.  Figure  4.30  shows  the  labelling  on  a  timetree  for  a  computation 
that  rolls  back  twice. 

We  make  a  couple  of  observations: 

•  Paths  are  sufficient  to  sort  nodes.  If  A  and  B  are  two  nodes  in  the  user  timetree,  A  precedes 
B  iff  the  path  for  A  is  a  prefix  of  the  path  for  B. 

•  Paths  can  be  greatly  condensed.  The  ith  node  label  in  a  path  is  the  integer  i  -  1.  The  ith 
edge  label  in  a  path  is  the  same  as  the  label  on  edge  i  —  1 ,  unless  edge  i  leads  to  a  node 
restored  by  rollback.  Unless  the  computation  has  rolled  back  to  initial  conditions,  all  paths 
start  with  (0, 0). 

Let  A  have  USR.  COUNT  =  k.  Then  the  path  from  the  root  to  A  has  the  form 

(0,£'o),— ,(^  -  l,£^fe-i) 

We  condense  this  path  by  deleting  (z,  Ei)  if  Ei  =  £’,_i,  and  deleting  the  leading  pair  if  it  is  (0. 0). 
The  USR. NAME  of  A  consists  of  the  USR. COUNT  value,  and  this  condensed  path.  (Figure  4.30 
shows  this  construction.) 

To  compare  two  user  nodes  in  the  timetree  at  a  process,  we  check  whether  one  node’s  path  is 
a  prefix  of  the  other.  Suppose  fc,  P  and  k',  P'  are  the  USR. NAME  values  for  nodes  A  and  A',  and 
k  <  k'.  We  determine  if  A  — >  A'  by  removing  from  P'  any  pairs  (m,  Em)  with  m  >  k,  and  then 
checking  if  the  resulting  list  is  identical  to  P. 

The  length  of  the  condensed  path  in  the  USR. NAME  for  node  A  is  proportional  to  the  number 
of  rollbacks  in  the  path  from  the  root  to  A.  If  failures  occur,  this  will  not  be  constant;  thus,  for  the 
implementation  we  sketched  above,  USER  .partial -ORDER  timestamp  vectors  will  not  be  linear. 
The  size  instead  will  be  proportional  to  the  number  of  rollbacks  in  the  system  .partial,  order 
past  of  the  nodes  in  the  vector. 
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Rgure  4.30  Path  information  allows  sorting  in  user  timetrees.  In  this  sample 
tree,  we  have  labeled  nodes  with  their  usr.count  value  and  edges  with  their 
INCARNATION. COUNT  value.  The  USR.NAME  for  A  is  4, 0;  for  S  is  3,  {( 1, 1)},  and 
for  C  is  8,  {( 1, 1),  (5, 2)}.  Wo  can  determine  that  A  does  not  precede  C,  since  the 
condensed  path  for  A  does  not  equal  {(1,1)}  (the  condensed  path  for  C  trimmed 
for  A).  We  can  determine  that  B  precedes  C,  since  the  condensed  path  for  B  does 
equal  the  condensed  path  for  C,  trimmed  for  B. 


We  can  reduce  the  amortized  length  of  USR.NAME  values  by  having  processes  try  to  avoid 
transmitting  redundant  data.  Suppose  process  p  wants  to  send  the  USR.NAME  of  A  to  process  q. 
Instead  of  sending  the  path  from  the  root  to  A,  process  p  can  send  the  path  from  an  intermediate 
node  B  to  A.  If  process  q  already  knows  the  path  from  the  root  to  B,  then  process  q  quickly 
reconstructs  the  full  path.  If  not,  process  q  recognizes  that  it  is  missing  data  and  blocks  until  it  can 
obtain  it. 

One  example  of  this  amortization  technique  is  using  a  heuristic  similar  to  Strom  and  Yemini’s 
approach.  Each  time  a  process  rolls  back,  it  broadcasts  the  path  to  that  rollback  node  along  with  its 
new  incarnation  count.  Subsequent  USR.NAME  values  consist  solely  of  INCARNATION  .COUNT 
and  USR.COUNT.  (This  heuristic  introduces  blocking  into  our  protocol,  but  still  maintains  the 
at-most-once  lower  bound  on  rollbacks  at  a  process.)  However,  a  wide  range  of  other  heuristics 
exists  for  this  technique.  At  one  extreme,  process  p  transmits  only  the  end  of  the  path;  at  the  other 
extreme,  process  p  maintains  the  most  recent  system  timestamp  vector  received  from  q,  and  uses 
the  q  entry  as  the  intermediate  node  for  a  name  sent  to  q. 

Commitment  and  garbage  collection  may  integrate  nicely  with  these  amortization  techniques, 
since  processes  may  maintain  a  log  vector  of  the  maximal  known  logged  nodes  at  other  processes. 
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4.3.5.  Piecewise  Determinism  and  State  intervais 


The  presentation  of  our  protocol  allowed  transitions  between  states  to  be  nondeterministic.  As 
Section  4.1.5  observed,  providing  complete  recoverability  under  this  model  requires  logging  every 
transition.  However,  realistic  distributed  systems  often  guarantee  that  process  execution  is  deter¬ 
ministic  between  message  receives;  models  for  these  systems  focus  on  state  intervals  instead  of 
states.  With  a  few  modifications,  our  protocols  adapt  to  this  environment. 

The  framework  of  Section  2.2. 1  can  express  piecewise  determinism  by  restricting  processes  to 
blocking  receives — that  is,  if  a  process  attempts  to  receive  a  message,  it  pauses  until  a  message 
is  available.  (This  approach  contrasts  with  more  flexible  polling  or  interrupting  approaches.)  We 
require  all  other  process  transitions  to  be  deterministic.  A  state  interval  is  the  period  of  deterministic 
execution  between  successive  receive  events.  We  can  build  a  simple  time  model  to  collect  the 
sequence  of  nodes  between  successive  receives  into  a  state  interval  node. 

The  coarser  granularity  of  state  intervals  makes  logging  and  replay  easier.  However,  this 
granularity  also  changes  how  rollback  should  affect  the  mapping  between  the  system  and  user  time 
models.  In  the  state  model,  when  a  process  restores  state  Au,  it  establishes  a  sibling  of  Au  in  the 
user  timetree.  However,  this  approach  does  not  work  for  state  intervals,  since  the  state  at  other 
processes  may  depend  directly  on  Au,  rather  than  indirectly  through  a  subsequent  state  transition 
node.  Restoring  a  sibling  of  Au  incorrectly  makes  the  state  at  these  processes  orphans. 

The  solution  to  this  problem  is  for  rollback  to  restore  state  interval  Au  itself,  not  a  sibling.  The 
interval  following  the  re-execution  of  Au  begins  the  new  timetree  branch.  Consequently,  the  set  of 
system  interval  nodes  that  a  user  interval  node  represents  is  not  necessarily  a  connected  sequence. 
This  fact  has  some  implications  for  Section  4.2.5.  Our  graphical  shorthand  no  longer  applies,  since 
Theorem  4. 1  no  longer  holds;  however,  we  can  still  establish  a  weaker  version  of  Theorem  4. 1 . 


Theorem  4.5  Suppose  Au  and  Bu  are  user  state  intervals.  Let  Bs  be  any  system 
state  interval  corresponding  to  Bu,  but  let  As  be  the  minimal  system  state  interval 
corresponding  to  Au-  If  Au  Bu  then  As  Bs- 


Proof  This  result  follows  from  induction  on  precedence  paths.  If  Au  and  Bu  occur  at  the  same 
process,  then  the  result  easily  follows.  If  Au  sends  a  message  that  begins  Bu,  then  some  system 
state  interval  following  or  equaling  As  must  also  precede  Bs- 

For  more  general  paths,  choose  an  intermediate  node  Cu  with  Au  — ►  Cu  — >  Bu,  and  let  Cs 
be  the  minimal  system  state  node  corresponding  to  Cu-  Establish  the  result  for  Au  and  Cu,  and 
then  for  Cu  and  Bu-  □ 

Theorem  4.4  holds  for  state  intervals,  since  we  may  substitute  Theorem  4.5  for  Theorem  4. 1  in 
its  proof. 
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4.3.6.  Comparison  to  Related  Work 


Checkpointing  As  we  note  in  Section  4. 1 .3.  recovery  protocols  based  on  checkpointing  restore 
the  system  to  a  recovery  line  composed  of  local  checkpoints.  Organizing  recovery  lines  into  an 
increasing  sequence  (e.g.,  [BCS84,  Ci89])  may  allow  asynchronous  recovery  and  may  tolerate 
concurrent  failures  (since  one  recovery  line  will  clearly  be  earliest).  More  complex  structures 
of  recovery  lines  require  more  synchronization  upon  recovery,  but  may  allow  some  surviving 
processes  to  proceed  without  roiling  back.  However,  unless  every  adjusted  rollback  vector  is  a 
recovery  line,  checkpointing-based  recovery  will  force  surviving  processes  to  roll  back  computation 
that  does  not  depend  on  the  computation  lost  due  to  failure 

The  distinction  between  checkpointing-based  protocols  and  the  message  logging  family  some¬ 
times  blurs.  (Johnson  [Jo931  presents  a  protocol  that  is  explicitly  hybrid.)  A  checkpointing  scheme 
in  which  processes  checkpoint  every  local  state  to  stable  storage  before  proceeding  would  be  similar 
to  pessimistic  rollback;  a  checkpointing  scheme  in  which  processes  checkpoint  every  local  state 
to  volatile  storage  (and  eventually  to  stable  storage)  would  be  similar  to  optimistic  rollback.  Our 
protocol  adapts  to  this  latter  environment. 

Ciuffoletti  [Ci84]  proposed  a  checkpointing  protocol  for  synchronous  communication:  with 
each  message,  processes  use  a  heavy-weight  scheme  to  exchange  history  and  checkpoint  infor¬ 
mation  between  sender  and  receiver.  Although  some  aspects  of  this  scheme  foreshadow  the  user 
and  system  levels  in  our  work,  Ciuffoletti’s  protocol  is  inherently  synchronous,  and  the  model  of 
synchronous  communication  does  not  apply  to  realistic  distributed  systems. 


Optimistic  Roilback  Recovery  Strom  and  Yemini  [StYe85]  initiated  the  area  of  optimistic 
rollback  recovery.  They  presented  optimistic  techniques  for  surviving  processes  to  ensure  complete 
recoverability,  and  a  rollback  protocol^  that  allows  processes  to  recover  mostly  asynchronously, 
although  delayed  transmission  of  incarnation  start  information  may  cause  blocking.  This  protocol 
implicitly  uses  partial  order  time  to  track  dependency  on  failed  computation  (and,  to  our  knowledge, 
is  the  the  earliest  publication  of  the  timestamp  vector  mechanism). 

However,  Strom  and  Yemini  did  not  consider  the  flow  of  knowledge  of  rollback.  They  conse¬ 
quently  built  an  orphan  test  that  is  strictly  weaker  than  ours.  Their  protocol  never  falsely  concludes 
that  a  non-orphan  state  is  an  orphan.  However,  their  protocol  will  falsely  conclude  that  some 
orphan  states  are  not  orphans — even  when  the  testing  process  could  potentially  know  otherwise. 
These  false  negatives  make  it  possible  for  a  single  failure  at  one  process  to  cause  another  process  to 
roll  back  an  exponential  number  of  times,  since  the  unfortunate  process  never  rolls  back  far  enough 
(until  the  last  time).  Sistia  and  Welch  [SiWe89]  claim  an  0(2")  upper  bound  for  the  worst  case  in 
the  Strom  and  Yemini  protocol.  We  prove  an  (7(2")  lower  bound  by  construction  in  Figures  4.31 
through  4.33. 


*In  some  sense.  Merlin  and  Randell  [MeRa78]  foreshadowed  Strom  and  Yemini’s  work  by  presenting  a  protocol  based 
on  a  representation  similar  to  Petri  Nets;  this  protocol  could  be  transformed  and  optimized  into  one  similar  to  Strom 
and  Yemini’s. 
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Johnson  and  Zwaenepoel  [Jo89,  JoZw90]  developed  a  general  model  for  optimistic  rollback 
recovery.  They  used  state  lattices  from  partial  order  time  to  show  that  a  maximal  recoverable 
system  state  exists,  and  presented  synchronized  protocols  to  recover  this  state — even  without 
reliable  message  delivery.  Sistla  and  Welch  [Se891  presented  two  protocols  for  optimistic  recovery 
that  avoid  the  exponential  worst  case  by  using  synchronization  during  recovery;  like  Strom  and 
Yemini,  Sistla  and  Welch  require  reliable  FIFO  message  channels.  Peterson  and  Kearns  [PeKe93] 
recently  presented  a  recovery  protocol  using  vector  clocks  that  synchronizes  during  recovery  by 
passing  tokens. 


Summary  Optimistic  rollback  protocols  improve  on  other  recovery  methods  by  requiring  lit¬ 
tle  synchronization  during  failure-free  operation  and  by  requiring  only  the  theoretical  minimum 
amount  of  computation  to  be  rolled  back  (only  the  computation  that  depends  on  the  computation 
lost  due  to  failure).  Our  protocol  improves  on  previous  optimistic  rollback  protocols  by  providing 
both  completely  asynchronous  recovery  and  a  worst-case  upper  bound  of  at  most  one  rollback  at 
each  process.  The  key  to  asynchronous  optimistic  rollback  recovery  is  the  realization  that  two 
levels  of  partial  order  time  abstraction  are  relevant:  causal  dependency  on  rolled-back  events  and 
potential  knowledge  of  rollbacks.  Our  distributed  time  framework  allows  us  to  explicitly  track 
these  two  levels  of  time.  We  improve  even  on  the  explicit  “vector  time”  work  of  Peterson  and 
Kearns  by  truly  using  the  full  power  of  temporal  abstraction. 


Figure  4.31  The  failure  of  one  process  may  lead  to  n(2")  rollbacks  using  Strom 
and  Yemini’s  protocol.  This  diagram  shows  how  to  construct  computations  exhibit¬ 
ing  this  behavior.  We  build  this  computation  inductively.  This  diagram  shows  the 
hypothesis:  the  existence  of  a  computation  C„  on  n  processes  that  accepts  n  user 
messages  Mi,  then  n  system  messages  /2i,  announcing  the  rollback 
of  the  send  events  of  the  user  messages.  We  assume  that  the  single  failure  at 
process  po  triggers  2”  -  1  failures  in  computation  C„,  and  that  2"“'  of  these  failures 
occur  at  process  p„.  Figure  4.32  shows  how  to  build  computation  C„+i  from  two 
copies  of  computation  Cn.  Figure  4.33  shows  the  base  for  n  =  1 . 
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Figure  4.32  This  diagram  shows  how-to  build  Cn+i  from  two  copies  of  Cn- 
Computation  C„+i  receives  n  +  1  user  messages  M],  then  receives  n  +  1 

system  messages  Ri,...,Rn+i  announcing  the  rollback  of  the  user  send  events. 
Process  pi  receives  M\,  establishing  a  dependency,  then  sends  n  messages 
M,', M'  to  the  first  copy  of  C„.  This  establishes  dependency  on  pi ,  and  transitive 
dependency  on  the  send  event  of  Mi .  Process  pi  then  receives  rollback  announce¬ 
ment  Ri ,  rolls  back,  and  sends  the  announcements  out  to  the  first  copy  of  C„.  The 
n  remaining  user  messages  M2, Mn+i  are  then  fed  directly  to  the  second  copy  of 
Cn,  followed  by  the  remaining  rollback  announcements.  (We  cannot  repeat  the  pi 
trick  since  pi  now  knows  about  the  initial  failure.  However,  processes  p2  through  p„ 
only  know  about  the  failure  at  pi.)  The  assumption  that  C„  rolls  back  2"  -  1  times 
puts  the  number  of  rollbacks  in  Cn+i  at  2(2”  I)  -I-  1  =  -  1.  The  assumption 

that  the  last  process  in  Cn  rolls  back  2”"'  times  gives  2(2"~' )  =  2"  rollbacks  at  pn+\ 
in  Cn+i.  Hence  this  construction  establishes  the  induction. 
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Figure  4.33  This  diagram  shows  the  construction  of  C\ ,  the 
base  for  the  inductive  construction  of  Figure  4.32. 


4.4.  A  General  Framework 

From  a  high  level,  a  rollback  protocol  consists  of  the  initiating  process  requesting  that  the  system 
roll  back  to  some  point,  and  each  of  the  other  processes  receiving  this  request  and  cooperating. 
This  sketch  raises  some  questions: 


•  How  does  the  initiating  process  specify  the  state  to  be  restored? 

•  How  should  the  other  processes  react? 


The  protocol  in  Section  4.3  (as  well  as  most  of  those  in  the  literature)  uses  the  current  past- 
current  past  (CP-CP)  paradigm:  the  initiator  chooses  a  state  that  it  currently  regards  as  being  in 
its  past  (that  is,  a  state  USER. PARTIAL .ORDER-preceding  the  decision  to  roll  back),  and  the  other 
processes  each  choose  a  state  from  their  current  pasts. 

The  CP-CP  paradigm  has  the  advantage  of  being  well-defined.  Suppose  the  system  is  c  urrently 
consistent,  and  that  the  initiating  process  restores  state  A.  Then  the  adjusted  rollback  vector  R'  ( .4 ) 
(from  USER -PARTIAL -ORDER)  will  be  consistent  and  concurrent  with  restored  A  (and  subsequent 
computation).  When  recovery  is  complete,  the  virtual  FAILURE -FREE -Partial -Order  computa¬ 
tion  will  consist  of  the  portion  of  the  initial  failed  computation  preceding  R“(/l),  with  the  revised 
computation  appended  from  there. 

Phrasing  rollback  in  terms  of  the  CP-CP  paradigm  immediately  suggests  alternative  paradigms: 
letting  the  initiator  and/or  the  other  processes  choose  from  their  general  pasts.  Allowing  the 
initiating  process  to  restore  any  state  from  its  timetree  permits  the  flexibility  of  rolling  back  rollback. 
(The  implementation  sketch  of  Section  4.3.4  allows  sorting  of  events  in  user-trees  even  if  the  trees 
grow  through  general-past  rollback.)  We  can  construct  scenarios  where  this  might  be  useful.  For 
example,  suppose  process  p  has  been  performing  some  valuable  computation  in  silence.  The 
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system  assumes  p  has  failed  and  restarts  a  new  version — ^but  when  the  old  version  speaks  up,  the 
system  decides  it  would  prefer  to  discard  the  rollback  and  re-incorporate  the  old  p. 

As  we  have  seen  in  Section  4.2.6,  allowing  both  the  initiator  and  the  others  to  choose  states 
from  their  genera]  pasts  permits  ambiguity.  Implementing  this  approach  in  a  distributed  fashion 
is  difficult:  processes  must  disjointly  choose  consistent  paths.  (Figure  4.34  shows  an  example.) 
Constraining  the  other  processes  to  choose  from  their  current  pasts  (but  consistently  with  the 
general  past  state  chosen  by  the  initiator)  also  creates  problems.  For  example,  the  other  processes 
may  not  be  able  to  choose  states  that  permit  the  initiator’s  choice  to  exist.  Figure  4.35  shows  one 
such  situation. 

One  interesting  avenue  for  future  work  lies  in  having  the  initiator  choose  not  a  state  but  a 
predicate  describing  a  system  state  it  would  like  restored.  (Of  course,  such  an  approach  requires 
that  the  predicate  is  satishable.) 

Another  interesting  avenue  is  to  implement  general-past  rollback  by  formally  rolling  back  roll¬ 
back.  We  might  build  a  third  level  of  partial  order  time  to  express  the  meta-recovery  computation, 
and  use  our  earlier  protocols  to  roll  back  the  recovery  computation  that  performed  the  original 
rollback. 
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Figure  4.34  During  rollback  recovery,  allowing  processes  other  than  the  initiator 
to  choose  from  their  general  pasts  creates  difficulty.  For  example,  suppose  process 
q  decides  to  roll  back  B.  The  naively  defined  rollback  pseudo-vector  of  B  is  the 
set  S.  Since  S  touches  multiple  branches  at  processes  p  and  r,  allowing  these 
processes  to  restore  states  from  their  general  pasts  creates  ambiguity:  e.g.,  if 
process  p  chooses  its  A  branch  and  process  r  chooses  its  C  branch,  then  the 
resulting  system  state  will  not  be  consistent.  For  system  consistency,  processes 
p  and  r  must  both  choose  their  primed  branches  or  both  choose  their  unprimed 
branches. 
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Figure  4.35  During  rollback  recovery,  allowing  the  initiator  to  choose  from  its 
general  past  while  constraining  other  processes  to  their  current  past  creates  diffi¬ 
culty.  For  example,  suppose  the  current  virtual  computation  has  frontier  Aj,B2, 
but  process  p  wishes  to  restore  the  state  A3.  No  state  at  process  q  in  the 
USER -PARTIAL -ORDER  past  of  's  both  Consistent  and  concurrent  with  process 
p’s  new  state. 


112 


Chapter  5 

Security  and  Privacy  for  Distributed 
Time 


5.1.  Overview 


Systems  of  time  more  general  than  the  linear  order  of  real  time  are  central  to  solving  application 
problems  in  asynchronous  distributed  systems.  Since  protocols  for  these  applications  require 
examining  the  underlying  distributed  time  models,  explicitly  providing  a  distributed  time  service 
simplifies  and  clarifies  the  task  of  protocol  design. 

However,  while  real  time  can  be  determined  from  an  independent  physical  device,  relations 
such  as  partial  order  time  cannot  be  determined  in  isolation.  Tracking  relations  such  as  the 
PARTIAL. ORDER.TEME  model  requires  collecting  and  sharing  information;  tracking  relations  in 
time  models  that  dispense  with  transitivity  or  permit  cycles  involves  even  more  local  information. 
Thus  dealing  with  distributed  time  exposes  protocols  to  security  risks.  Is  the  information  a  process 
receives  correct?  Can  shared  information  be  used  for  dishonest  purposes? 

Encapsulating  a  system’s  dealings  with  partial  order  time  into  a  single  time  service  provides  an 
arena  to  examine  and  resolve  security  and  temporal  issues  for  protocol  design. 

The  proposal  document  [Sm9 1  ]  for  this  thesis  recognized  the  central  role  of  partial  order  clocks, 
cataloged  some  of  the  security  and  privacy  risks,  and  gave  the  original  presentation  of  the  Signed 
Vector  Timestamp  protocol,  which  protects  against  some  of  these  risks.  While  this  protocol  prevents 
dishonest  processes  from  forging  causal  dependence  on  nodes  at  honest  processes,  it  suffers  from 
some  drawbacks: 

•  The  Signed  Vector  Timestamp  protocol  cannot  guarantee  detection  of  causal  paths  touching 
dishonest  processes.  Consequently,  Signed  Vectors  cannot  be  used  to  build  secure  protocols 
for  problems  such  as  distributed  snapshots  requiring  accurate  detection  of  non-precedence. 

•  The  Signed  Vector  Timestamp  protocol  leaks  private  information,  since  vector  entries  are 
publicly  readable. 

•  The  Signed  Vector  Timestamp  protocol  requires  each  process  to  check  n  signatures. 


•  The  Signed  Vector  Timestamp  protocol  requires  that  the  temporal  relation  being  tracked 
express  all  paths  of  information  flow;  thus  the  protocol  does  not  extend  to  more  general 
relations  (such  as  user. partial -ORDER  from  Chiq)ter  4). 


This  Chapter  In  this  chapter,  we  use  new  developments  in  inexpensive  tamper-proof  hard¬ 
ware  to  build  the  Sealed  Vector  Timestamp  protocol,  which  provides  stronger  security  and  privacy 
protection  than  any  previous  protocol.  Sealed  Vectors  solve  previously  open  problems  by  pre¬ 
venting  dishonest  processes  from  forging  dependence  on  any  events,  and  by  preventing  dishonest 
processes  from  denying  dependence  (if  malicious  processes  cannot  communicate  covertly).  (Even 
with  covert  communication.  Sealed  Vectors  provide  some  protection  against  denying  dependence.) 
Sealed  Vectors  also  move  beyond  previous  work  by  addressing  privacy  risks,  and  by  providing 
secure  clocks  for  partial  orders  where  information  flow  does  not  imply  precedence. 

The  proposal  document  opened  up  this  area  of  research.  This  chapter  presents  the  most  se¬ 
cure  protocols  to  date,  and  solves  problems  other  researchers  left  open  [ReGo93].  Section  5.2 
discusses  the  inherent  security  and  privacy  risks  for  partial  order  time.  Section  5.3  surveys 
the  defenses  and  presents  our  new  protocol.  Section  5.4  discusses  our  new  protocol,  and 
Section  5.5  considers  some  directions  for  future  research.  For  clarity  of  presentation,  most  of 
this  chapter  considers  the  problems  of  tracking  temporal  relations  in  a  Type  4  parallel  pair  such 
as  (PARTIAL-ORDER-TIME,  TIMELINES).  Chapter  6  will  consider  the  implications  of  this  work  for 
more  general  time  models. 

(Preliminary  versions  of  some  this  material  appeared  in  earlier  publications  [SmTy9 1 ,  SmTy94].) 


5.2.  Security  and  Privacy  Attacks 

Partial  order  time  draws  on  data  distributed  throughout  the  system.  Consequently,  building  partial 
order  clocks  requires  that  processes  share  private  information,  and  trust  the  private  information 
shared  with  them.  This  opens  opportunities  for  Byzantine  (malicious)  processes  to  manipulate 
the  clock  protocols,  and  consequently  to  manipulate  application  protocols  built  on  these  clock 
protocols. 

We  sketch  four  such  attacks  on  vector  clocks. 


Nonsense  Attacks  Malicious  processes  can  send  arbitrary  vector  entries.  Since  honest 
processes  will  dutifully  copy  and  pass  on  these  values,  a  single  act  by  a  single  malicious  process 
can  destroy  the  validity  of  many  vectors  throughout  the  system.  (Lamport  total  order  clocks  [La78] 
are  particularly  vulnerable  to  these  attacks.)  Simple  sanity  checks  fail  to  combat  this  problem. 
Suppose  vector  entries  are  integers.  If  honest  processes  refuse  to  accept  vector  entries  that  have 
increased  more  than  N,  a  dishonest  process  can  repeatedly  increase  an  entry  by  V  -  1 .  The  next 
honest  process  the  victim  talks  to  may  then  mistakenly  identify  the  honest  victim  as  corrupt. 
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Malicious  Backdating  Malicious  processes  can  selectively  reduce  vector  entries,  and  thus 
fool  honest  processes  into  thinking  events  happened  earlier  than  they  actually  did.  Consider  the 
implication  of  trading  commodities  options  on  a  public  network.  Figure  5.1  shows  how  Malicious 
Backdating  permits  the  crime  of  options frontrunning,  which  can  occur  when  brokers  may  trade  both 
for  themselves  and  for  their  clients.  (One  place  where  options  frontrunning  occurs  is  the  Chicago 
commodities  exchange.)  If  a  broker  happens  to  buy  a  small  quantity  of  shares  for  himself  before 
his  client  requests  a  large  number  of  shares,  then  the  broker  will  make  a  tidy  sum.  Consequently, 
on  receiving  a  client  request,  a  dishonest  broker  has  incentive  to  issue  a  request  of  his  own  that 
appears  not  to  have  followed  the  client  request.  In  an  electronic  exchange  using  vector  clocks,  a 
malicious  broker  can  do  this  by  re-using  an  old  vector  on  his  purchase  request.' 


Malicious  Postdating  Malicious  processes  can  selectively  inflate  vector  entries,  and  thus  fool 
honest  processes  into  thinking  events  happened  later  than  they  actually  did.  Figure  5.2  shows 
how  such  Malicious  Postdating  permits  insider  trading.  A  malicious  process  can  send  a  cohort 
an  advance  copy  of  an  announcement  along  with  an  advanced  vector.  The  cohort  can  act  on  this 


Rgure  5.1  Malicious  processes  can  selectively  backdate  nodes.  Here,  Bob  com¬ 
mits  the  crime  of  options  frontrunning  by  making  his  own  purchase  appear  not  to 
follow  his  client’s  request. 


'In  the  physical  Chicago  exchange,  the  only  defense  the  FBI  has  against  options  frontninning  is  placing  undercover 
agents  in  the  pit  to  look  for  unusually  lucky  brokers. 
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rfata,  but  use  the  advanced  vector  to  hide  her  headstart.  (The  cohort  could  even  be  unwitting;  the 
malicious  process  might  frame  her  now,  in  order  to  spread  the  blame  should  the  ruse  be  discovered 
later.) 


Compromised  Privacy  Malicious  processes  can  correctly  perform  the  vector  clock  protocol, 
but  use  the  vector  entries  to  gain  illicit  knowledge.  Figure  5.3  shows  how  this  technique  reveals 
anonymous  whistleblowers.  Changes  in  subsequent  timestamp  vectors  sent  from  Alice  to  Bob  show 
the  identities  of  processes  communicating  with  Alice. 


5.3.  Defenses 


An  ideal  clock  should  report  “A  — *■  B"  exactly  when  A  precedes  B,  even  if  processes  perform 
malicious  actions.  An  ideal  clock  should  also  confine  private  information.  We  can  evaluate  clock 
protocols  by  this  standard:  against  decreasing  amounts  of  honesty,  how  well  do  clocks  perform? 

Many  application  protocols  use  forms  of  partial  order  time  and  vector  clocks.  A  clock  protocol 
meeting  this  ideal  transparently  protects  higher-level  applications  against  the  security  and  privacy 
risks  of  Section  5.2. 


Figure  5.2  Malicious  processes  can  selectively  postdate  nodes.  Here,  Bob  leaks 
an  advance  copy  of  his  public  announcement  to  Cathy  in  such  a  way  that  allows 
her  to  act  on  the  data  first,  without  appearing  to  have  had  a  headstart. 
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bOMOf 

Bad  Bob 


boss  of 


Honest  Cathy 


Figure  5.3  Malicious  processes  can  exploit  vector  data  for  illicit  purposes.  Here, 
Bob  uses  the  timestamp  vectors  from  Alice  to  learn  the  identity  of  whistleblower 
Cathy. 


5.3.1 .  Previous  Work 


If  all  processes  are  honest,  then  the  process  p  entries  in  ail  vector  timestamps  originate  at  process  p. 
Our  Signed  Vector  Timestamp  protocol  [SmTy91,  ReGo93]  builds  on  this  observation  by  requiring 
each  process  to  digitally  sign^  its  entries  in  outgoing  timestamp  vectors.  That  is,  the  process  p 
entry  in  a  timestamp  vector  now  consists  of  the  name  of  a  node  at  process  p,  and  a  signature  from  p 
on  that  name.  This  scheme  prevents  malicious  processes  from  advancing  vector  entries  belonging 
to  honest  processes.  If  an  event  A  occurs  at  an  honest  process  and  our  time  model  expresses  all 
information  flow  paths,  then  possession  of  a  signed  entry  for  A  is  proof  of  dependence  on  A.  With 
Signed  Vectors,  A  — *■  B  when  an  honest  clock  reports  "A  — *  B"  (and  A  occurs  at  an  honest 
process).  If  all  processes  along  a  precedence  path  from  A  to  are  honest,  the  converse  is  also  true: 
an  honest  clock  reports  “A  — ►  when  A  — *  B. 

However,  Signed  Vectors  may  fail  if  precedence  paths  go  through  malicious  processes.  For 
example,  a  malicious  process  can  use  old  values  in  the  vector  entries  for  honest  processes,  as 
long  as  the  malicious  process  has  retained  the  matching  signatures.  Signed  Vectors  still  permit 
the  Malicious  Backdating  and  Malicious  Postdating  attacks.  Signed  Vectors  do  not  even  attempt 
to  address  the  Compromised  Privacy  attack.  These  problems  migrate  to  higher-level  applications. 
Inability  to  detect  non-precedence  reliably  can  result  in  inefficiency  (in  optimistic  rollback  recovery, 
processes  may  mistakenly  conclude  they  depend  on  failed  states)  or  complete  incorrectness  (in 
global  state  protocols,  processes  may  make  incorrect  decisions  regarding  “concurrent”  events). 


^Section  S.3.3  will  discuss  digital  signatures  in  more  detail. 


The  security  of  the  Signed  Vector  protocol  depends  on  the  fact  that  precedence  paths  and 
information  flow  paths  coincide.  If  precedence  and  information  flow  do  not  coincide,  then  Signed 
Vectors  do  not  provide  secure  clocks.  For  example,  consider  the  partial  order  describing  the  virtual 
computation  arising  after  rollback  with  modified  replay. 

Three  additional  protocols  exist  for  the  special  case  of  a  process  sorting  the  send  events  of  two 
messages  it  has  received  [ReGo93].  The  Piggybacking  protocol  generalizes  the  vector  timestamp 
protocol  by  timestamping  each  event  E  with  a  signed  record  of  all  messages  whose  send  events 
precede  E.  Piggybacking  (like  Signed  Vectors)  ensures  that  if  a  clock  reports  “A  — ►  B"  and 
A  occurs  at  an  honest  process,  then  A  — ►  B\  Piggybacking  further  limits  the  possible  actions 
of  a  dishonest  A  process  conspiring  to  make  a  clock  falsely  report  “A  — *  B."  However,  the 
Piggybacking  protocol  (also  like  Signed  Vectors)  cannot  reliably  detect  precedence  paths  touching 
malicious  processes,  and  does  not  address  the  issue  of  privacy.  The  other  two  protocols  from 
[ReGo93]  alter  the  order  in  which  messages  are  received.  These  protocols  address  the  problem  of 
detecting  the  partial  order  by  changing  the  partial  order;  further,  they  do  not  accurately  report  non¬ 
precedence.  The  Conservative  protocol  requires  that  before  sending  a  new  message,  a  process  wait 
for  acknowledgements  of  any  previous  messages  it  sent.  The  Causality  Server  protocol  assumes 
secure  FIFO  channels,  and  relies  on  a  trusted  central  intermediary  to  impose  a  total  order  on  all 
message  traffic. 


5.3.2.  The  Sealed  Vector  Timestamp  Protocol 

The  Sealed  Vector  Timestamp  protocol  has  security  properties  that  solve  previously  open  problems: 

•  Our  protocol  accurately  reports  “A  — >  B"  or  “A  -/->  B,”  in  the  presence  of  arbitrary  mali¬ 
cious  processes  (including  the  A  process). 

•  Our  protocol  does  not  leak  private  information. 


The  Sealed  Vector  Timestamp  protocol  satisfies  the  ideal  (assuming  no  covert  channels),  and 
protects  privacy  of  vector  entries  as  well.  Further,  this  protocol  extends  to  time  models  where 
information  flow  does  not  imply  precedence.  Table  III  compares  our  new  protocol  to  previous 
work. 


Overview  Our  new  protocol  rests  on  the  the  technology  of  secure  coprocessors  [ly  Ye93,  Yee94] : 
inexpensive  physically  secure  devices  with  a  CPU,  ROM,  and  non-volatile  RAM.  A  host  processor 
interacts  with  its  secure  coprocessor  through  formal  I/O  channels.  Any  other  method  of  determining 
the  internal  state  of  the  coprocessor — including  physically  penetrating  the  hardware — results  in  the 
erasing  of  RAM  and  CPU  registers.  Secure  coprocessors  are  being  deployed  rapidly  ;  commercial 
secure  coprocessor  products  are  available  from  IBM  (/iABYSS  [Wein87],  Citadel  [WWAP91]), 
and  have  been  announced  by  other  vendors  including  National  Semiconductor  [Va94],  Semaphore, 
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path  from  A  to 

B  is  honest 

A  is  honest 

no  one  (but  you) 
is  honest 

“A  — » B” 

=»  A-^B 

Signed,  PB, 
Sealed 

Signed,  PB, 
Sealed 

Sealed 

“A  — ►  B" 
<^A^B 

Signed.  PB, 
Sealed 

Sealed 

Sealed 

“A  — ^  B" 

^  A — >B 

Signed,  PB. 
Sealed 

Sealed 

Sealed 

privacy 
of  data 

Sealed 

Sealed 

Sealed 

Table  III  This  table  compares  how,  against  decreasing  amounts  of  honesty,  par¬ 
tial  order  clock  protocols  meet  the  clock  ideal:  reporting  “A  — ►  B"  <=>■  A  — >  B 
while  protecting  the  privacy  of  vector  entries.  Signed  denotes  the  Signed  Vector 
Timestamp  protocol;  Sealed  denotes  the  Sealed  Vector  Timestamp  protocol;  PB 
denotes  the  Piggybacking  protocol. 


Telequip,  and  Wave  Systems.  Various  protection  technologies  exist.  For  example,  IBM  wraps 
circuit  boards  in  nichrome  wire  and  then  seals  them  with  an  epoxy  mixture  chemically  stronger  than 
the  wire.  A  detection  circuit  monitors  the  resistance  of  this  wire  wrapping;  penetration  attempts 
will  disrupt  the  wire  wrapping  and  alter  the  resistance  (e.g.,  by  shorting  the  wire  or  by  cutting  it). 

Secure  coprocessors  only  possess  limited  amounts  of  power.  We  cannot  secure  an  entire 
workstation — even  if  we  could,  we  could  not  secure  the  user.  Bootstrapping  from  this  small 
amount  of  physical  security  into  full  protocol  security  raises  subtle  issues.  For  example,  malicious 
processes  might  attempt  to  bypass  coprocessors,  or  to  attack  communication  lines.  (Recent  work 
[ly  Ye93,  Yee94]  shows  how  to  protect  against  these  attacks.) 

In  the  Sealed  Vector  Timestamp  protocol,  each  process  runs  on  a  host  processor  with  a  secure 
coprocessor.  The  secure  coprocessor  creates  timestamp  vectors  and  seals  them  so  that  processes 
cannot  read  them.  Although  processes  can  store  and  exchange  timestamps,  they  need  to  query  a 
secure  coprocessor  in  order  to  compare  them. 

The  security  of  Sealed  Vectors  follows  from  a  number  of  properties: 


•  No  party  (except  a  secure  coprocessor)  can  obtain  information  about  the  contents  of  any 
vecto*-  entry  from  a  sealed  timestamp,  even  if  the  party  knows  the  other  entries. 

•  All  processes  must  route  incoming  and  outgoing  messages  through  secure  coprocessors. 

•  A  secure  coprocessor  must  be  able  to  verify  that  a  timestamp  was  properly  sealed  by  another 
secure  coprocessor. 


119 


•  Given  a  sealed  timestamp  and  an  event,  a  secure  coprocessor  must  be  able  to  verify  that  they 
match. 


5.3.3.  Cryptographic  Tools 

We  build  a  timestamp  scheme  meeting  this  description  using  two  common  cryptographic  tools: 
digital  signatures  and  bit-secure  public  key  cryptography  [DiHe76,  RSA78].  A  digital  signature  is 
a  function  S  from  a  value  space  to  a  signature  space  meeting  the  following  conditions: 

•  Given  a  value  v  and  a  signature  s,  any  party  can  determine  whether  is  a  valid  signature  of 
v:  whether  S{v)  =  a. 

•  However,  it  is  intractable  for  any  party  (except  the  privileged  signing  party)  to  take  a  set  of 
value-signature  pairs  and  produce  a  pair  not  in  this  set. 


Public  key  cryptography  consists  of  a  function  E  (from  the  plaintext  space  to  the  cipherspace) 
and  a  function  D  (from  the  cipherspace  to  the  plaintext  space)  meeting  the  following  conditions: 

•  For  any  plaintext  value  v,  any  party  can  calculate  E{v). 

•  For  any  plaintext  value  u,  £>(£’(1;))  =  v. 

•  It  is  intractable  for  any  party  (except  for  the  privileged  decrypting  party)  to  take  a  set  of 
plaintext-ciphertext  pairs  and  produce  a  pair  not  in  this  set. 

Standard  public  key  cryptography  requires  only  that  inverting  E  is  difficult  (without  the  priv¬ 
ilege  of  knowing  D).  Bit-secure  public  key  cryptography  requires  an  additional  level  of  security. 
Roughly  speaking,  from  a  given  ciphertext,  a  malicious  process  should  gain  no  information  about 
the  plaintext  that  it  did  not  know  a  priori.  ([Gold89]  presents  formal  definitions.)  Some  popular 
cryptosystems  (like  [Ra79]  and  [RSA78])  are  known  to  leak  number-theoretic  properties  of  the 
plaintexts  and  thus  fail  to  meet  this  condition  [ACGS88,  Li81].  For  the  Sealed  Vector  protocol 
to  attain  its  full  security  potential,  it  should  be  implemented  using  strong  cryptosystems  such  as 
[BlGo84]  or  [GoMi82]. 

Operation  We  use  cryptography  and  signatures  both  on  messages  {Emsg^  Z)msg  ^d  5n,sg)  and 
on  timestamps  {Eta,  £^tst  and  Sta)-^  Each  process  p  has  a  name,  which  we  denote  as  p.  Each 
process  p  runs  on  a  host  processor  with  a  secure  coprocessor,  which  we  denote  as  psc-  Each  secure 
coprocessor  knows  that  name  of  its  process. 


^This  presentation  assumes  global  schemes  for  all  processes.  In  practice,  giving  each  process  its  own  key  scheme  adds 
flexibility  and  another  level  of  security;  Section  S.4.2  discusses  these  issues. 
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Let  V  be  the  set  of  process  names,  let  S  be  the  set  of  event  names,  let  V  be  the  set  of  possible 
timestamp  vectors,  and  let  Ai  be  the  set  of  possible  message  texts.  Let  ^msg  and  ^tsi  be  the 
signatures  spaces  for  messages  and  timestamps,  respectively;  let  Cmg  and  Cm  be  the  cipherspaces 
for  messages  and  timestamps.  Our  signature  and  encryption  functions  act  according  to  these  rules: 

5m  :  C  X  V 

^m  :  C  X  V  X  ^m  Cm 
‘S'msg  :  ^  X  'P  X  A'f  X  Cm  ^msg 
•^msg  :  ^  X  P  X  .V(  X  Cm  ^  ^msg  '  ^  Cmsg 

The  functions  Ema^  and  Eta  are  public.  Each  secure  coprocessor  psc  has  the  ability  to  calculate 
J^msg.  Dot,  5msg.  and  5m;  the  coprocessor  psc  also  maintains  the  current  process  p  timestamp  vector, 
which  we  denote  as  Vp. 

Obtaining  Timastamps  Suppose  process  p  wants  to  obtain  a  timestamp  for  its  current  event 
A.  Process  p  submits  the  request  to  psc.  which  obtains  V(i4)  by  incrementing  the  p  entry  of  Vp. 
The  coprocessor  psc  then  returns  the  sealed  timestamp: 

T{A)  =  EtaiA,\{A),Stst{A,\{A))) 

Figure  5.4  illustrates  this  structure. 

The  signature  plays  two  roles  here.  First,  it  proves  that  this  vector  belongs  to  this  event. 
Secondly,  its  presence  inside  the  plaintext  protects  against  a  malicious  process  guessing  the  value 
of  the  vector,  and  verifying  this  guess  using  Eta- 


Rgure  5.4  A  sealed  timestamp  consists  of  the  encryption  of  three  items:  the 
name  of  an  event,  its  timestamp  vector,  and  a  signature  on  this  pair.  The  signature 
certifies  that  this  vector  belongs  to  this  event,  and  also  protects  against  guessing 
the  plaintext:  verifying  a  guessed  vector  requires  guessing  the  correct  signature. 
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Comparing  Timestamps  When  process  p  wants  to  compare  events  A  and  B,  it  sends  TiA) 
and  T(B)  to  psc-  The  coprocessor  a4>plies  Dut  to  extract  the  event  names,  vectors  and  signatures. 
If  the  signatures  are  valid,  the  coprocessor  then  compares  V(A)  and  V(B),  and  reports  the  result: 
either  ‘M  — ^  “B  A"  or  “/I  ^  5.” 


Sending  Messages  Suppose  process  p  wants  to  execute  a  send  event  S,  sending  a  message 
with  text  M  to  process  q.  Process  p  submits  M  and  q  to  the  secure  coprocessor  psc.  which 
calculates  the  timestamp'*  T{S),  and  returns  the  ciphertext 

M'  =  E^^ip,q,M,TiS),S^^ip,q,M,T{S))) 

Figure  5.5  illustrates  this  structure.  Process  p  then  transmits  the  message. 

A  malicious  process  might  still  be  able  to  suppress  this  message  M.  (For  example,  inFigure5.i, 
Bad  Bob  could  have  his  purchase  order  sealed,  but  only  introduce  it  into  the  network  if  he  receives 
an  order  from  his  client.)  The  secure  coprocessor  psc  can  protect  against  loss  by  requiring  a 
signed  acknowledgement  from  ^sc-  If  acknowledgement  does  not  arrive,  psc  can  retransmit 
the  message — ^perhaps  incrementally,  as  part  of  other  sealed  packets.  A  malicious  process  can 
successfully  suppress  a  message  only  by  permanently  partitioning  itself  from  the  network. 


Receiving  Meeeagee  Suppose  a  process  p  receives  a  ciphertext  message  M'.  To  read  M', 
process  p  needs  to  send  it  to  the  secure  coprocessor  psc-  The  coprocessor  applies  D^sg  to  obtain 
the  source  and  destination  process,  the  plaintext  M,  the  timestamp  T'S)  of  the  send  event,  and 
the  5msg  signature  of  this  data.  The  coprocessor  verifies  that  the  5msg  signature  is  valid  and  that 
p  is  the  intended  destination  process.  The  coprocessor  then  applies  Ptsi  to  the  timestamp,  checks 
its  signature,  and  obtains  the  vector  V(5).  The  coprocessor  then  performs  the  vector  timestamp 


Figure  5.5  The  message  ciphertext  encrypts  the  message  information  (source 
and  destination  processes,  message  text),  along  with  the  sealed  timestamp  of  the 
send  and  a  signature  of  these  values. 


^Since  messages  are  tagged  with  a  signature  before  encrypting,  using  the  unsealed  timestamp  V(5)  would  suffice  here. 
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protocol:  replacing  its  current  vector  Vp  with  the  entry-wise  maximum  of  Vp  and  V(5).  Finally, 
psc  returns  to  p  the  name  of  the  source  process,  the  plaintext  M,  and  (optionally)  the  timestamp 

ns). 


5.4.  Discussion 

5.4.1.  Results 

We  make  some  preliminary  observations: 

•  The  coprocessors  carry  out  the  vector  timestamp  protocol.  This  follows  directly 
from  the  description. 

•  Only  secure  coprocessors  can  unseal  messages  and  timestamps.  A  process 
may  be  able  to  guess  some  or  all  of  the  entries  of  a  given  timestamp  vector.  If  timestamps 
were  merely  vectors  encrypted  with  a  public  key,  then  a  process  could  guess  a  possible 
vector,  encrypt  the  guess,  and  compare  the  result  to  the  ciphertext.  However,  in  our  scheme, 
timestamps  are  the  encryption  of  a  vector  along  with  a  signature  of  that  vector.  Without 
knowing  the  signature  function,  a  process  cannot  verify  that  V  is  the  vector  in  the  timestamp 
EtstiA,  V,  Stst(A,  V)).  Timestamps  are  truly  sealed. 

Similarly,  with  high  probability  a  process  cannot  decrypt  an  encrypted  message  by  making 
some  lucky  guesses,  since  that  would  require  breaking  the  message  signature  5msg- 

•  Only  the  secure  coprocessor  at  the  source  process  may  seal  messages. 

Messages  arriving  at  an  honest  process  will  be  routed  to  the  secure  coprocessor,  which  will 
ignore  messages  that  do  not  include  both  a  valid  timestamp  and  a  valid  signature  on  the 
message  and  the  timestamp  together. 

•  Only  the  secure  coprocessor  at  the  Intended  destination  process  may  unseal 

a  message.  Sealed  messages  must  be  decrypted  to  be  intelligible.  The  receiving  process 
must  consult  its  secure  coprocessor,  since  the  encrypted  message  includes  the  name  of  the 
intended  destination  process.  (However,  a  malicious  process  can  receive  and  discard  an 
encrypted  message  without  consulting  its  coprocessor.  Section  5.4.2  considers  this  avenue.) 

Together,  these  assertions  imply  the  following  result: 

Theorem  5.1  Suppose  the  following  are  true  statements: 

•  All  messages  to  or  from  honest  processes  are  routed  through  through  secure 
coprocessors. 

•  The  encryption  and  signature  functions  are  not  breakable. 
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•  The  integrity  of  the  secure  coprocessors  is  not  compromised. 

Then  Sealed  Vectors  guarantee  the  following  properties: 

•  If  a  clock  reports  “A  — >  B"  then  A  — ►  B. 

•  If  node  A  precedes  node  B  along  a  path  where  each  message  edge  touches  an 
honest  process,  then  clocks  will  report  “A  — >■  B.” 

Proof  Let  ^  be  the  PARTIAL  _  ORDER  _  TDtE  graph  of  this  computation.  To  construct  a  graph  7  that 
reflects  the  computation  perceived  by  the  secure  coprocessors,  we  perform  these  steps: 

1 .  Copy  the  entire  timeline  belonging  to  each  honest  process. 

2.  For  each  message  edge  incident  to  an  honest  process,  copy  the  edge,  and  the  node  at  the  other 
end  (if  it  is  not  already  in  7). 

3.  Add  each  node  that  a  dishonest  process  registers  with  its  coprocessor. 

4.  At  each  dishonest  process,  connect  the  7  nodes  in  their  ^  sequence. 

A  coprocessor  reports  “A  — ►  B"  in  ^  iff  A  — >  5  in  7.  □ 


Corollary  5.2  Suppose  that,  in  addition  to  the  hypothesis  of  Theorem  5.1,  mali¬ 
cious  processes  cannot  communicate  without  using  the  sealed  message  protocol.  Then 
Sealed  Vectors  guarantee  that  clocks  report  “A  — ►  B"  iff  A  — >■  B. 


Proof  Construct  7  as  in  the  proof  of  Theorem  5. 1 ,  only  add  all  message  edges  and  their  incident 
nodes  (if  they  are  not  already  in  7).  □ 

This  protocol  improves  on  prior  work  by  offering  security  advantages: 

•  Complete  Results  If  a  clock  reports  “A  — ^  5,”  then  A  — >  B.  If  a  clock  reports 
“A  B"  (and  malicious  processes  cannot  communicate  using  covert  channels)  then 
A^B. 

•  No  Spoofing  Even  with  covert  channels,  a  malicious  process  cannot  deny  having  received 
a  message  from  an  honest  process. 

•  Privacy  The  private  information  shared  in  timestamps  is  confined  to  the  secure  coproces¬ 
sors. 

•  Wider  Application  The  Sealed  Vector  Timestamp  protocol  does  not  require  that  the 
partial  order  directly  arise  from  information  flow. 
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In  particular.  Sealed  Vectors  protect  against  all  the  attacks  catalogued  in  Section  S.2,  and  provide 
secure  clocks  for  scenarios  such  as  the  partial  order  arising  after  rollback  with  modified  replay. 

Sealed  Vectors  also  improve  on  Signed  Vectors  in  terms  of  scalability:  the  number  of  decryp¬ 
tions  required  on  incoming  messages  decreases  from  linear  to  constant. 


5.4.2.  Implicit  Assumptions 

This  chapter  has  made  several  implicit  assumptions  open  to  challenge.  We  discuss  these  challenges. 

No  Covert  Channels  Precedence  corresponds  to  paths  through  the  partial  .order  .time 
gr£q)h.  The  Sealed  Vector  protocol  prevents  a  single  malicious  process  from  masking  its  presence 
in  such  paths.  However,  if  malicious  processes  can  communicate  without  using  official  (that  is, 
coprocessor-sealed)  messages,  then  they  can  cooperatively  hide  their  presence  in  paths — since 
communication  outside  of  the  coprocessors  is  invisible  to  the  clocks. 

One  approach  to  this  problem  is  to  make  such  communication  very  difficult:  for  example, 
by  having  the  secure  coprocessors  handle  net  traffic  (and  perhaps  snoop  on  Ethernet  packets), 
malicious  processes  would  be  forced  to  communicate  outside  the  network. 

Covert  communication  is  also  possible  using  in-band  signaling,  since  it  may  be  possible  to 
extract  information  from  sealed  messages  without  consulting  secure  coprocessors.  For  example, 
a  malicious  process  might  draw  conclusions  from  the  existence  of  the  message,  the  length  of  the 
message  (real  encryption  usually  breaks  long  text  into  blocks  and  encrypts  each  block  separately) 
or  the  frequency  of  multiple  messages. 


Security  of  Coprocessors  The  protocol  depends  on  the  physical  security  of  the  coprocessors. 
In  practice,  secure  coprocessors  are  extremely  difficult  to  penetrate.  However,  as  with  any  security 
mechanism  (physical  or  computational),  it  may  be  possible  to  compromise  the  system  if  the  attacker 
is  willing  to  pay  tremendous  amounts  of  money.  (For  a  detailed  analysis  of  the  cost,  see  [Wein91  ].) 
What  do  we  do  if  the  exception  case  occurs — if  a  coprocessor  is  compromised?  One  way  to  limit 
the  damage  is  to  use  separate  5msg,  Stst  and  FJmsg  functions  for  each  process.  This  technique  prevents 
a  compromised  coprocessor  from  impersonating  someone  else  or  performing  message  decryption 
for  someone  else.  Using  separate  Etsi  functions  prevents  the  compromised  coprocessor  from  doing 
comparisons  for  someone  else,  but  requires  re-encrypting  forwarded  timestamps.  (Section  5.5 
considers  some  further  defenses.) 


Validity  of  Keys  Giving  each  coprocessor  its  own  keys  raises  the  issue  of  key  management:  a 
new  coprocessor  must  somehow  announce  its  public  keys.  A  straightforward  technique  to  prevent 
dishonest  processes  from  impersonating  a  “new  coprocessor”  is  to  have  new  coprocessors  obtain 
certificates,  signed  by  a  universally  trusted  agent,  listing  their  identity  and  public  keys. 
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5.5.  Future  Work 


Limiting  Penetration  Damage  What  can  we  do  if  the  integrity  of  a  coprocessor  is  compro¬ 
mised?  Penetration  exposes  any  data  that  a  coprocessor  has  saved.  However,  an  uncompromised 
coprocessor  can  securely  forget  data.  This  observation  suggests  an  alternative  Give-and-Forget 
timestamping  scheme.  Suppose  process  p  at  event  S  sends  a  message  to  process  q,  who  receives 
it  at  event  R.  Process  p  generates  a  key  pair  A"i,s.  A’a.s-  Process  p  signs  a  certificate  asserting 
that  A'2,5  is  its  public  key  for  event  S,  and  sends  this  certificate  along  with  the  private  key  K\,s 
to  process  q  with  the  message.  Process  q  uses  the  private  key  A'i,s  to  encrypt  an  identifier  for  R 
and  then  erases  the  key.  Process  q  then  has  a  universally  verifiable  certificate  that  it  knew  about 
S  when  R  occurred.  However,  examining  this  certificate  allows  no  one — not  even  process  q — to 
forge  a  new  certificate  of  knowledge  of  S  without  the  cooperation  of  process  p. 

This  technique  allows  a  secure  coprocessor  to  generate  proof-of-timestamp  certificates  showing 
the  last  message  received  from  each  uncompromised  process.  Should  the  coprocessor  later  be 
compromised,  it  cannot  produce  new  certificates  for  these  messages.  To  prevent  a  compromised 
coprocessor  from  rolling  back  timestamp  entries,  we  can  require  all  coprocessors  to  use  these 
proof-of-timestamp  certificates  to  prove  the  validity  of  each  entry  in  their  timestamp  vectors. 

Other  approaches  for  pre-compromised  coprocessors  to  limit  the  forging  power  of  their  com¬ 
promised  versions  include  the  Distributed  Trust  and  Digital  Timestamping  techniques  of  [BHS92, 
HaSt91],  as  well  using  data  on  acknowledgement  packets. 


Improving  Performance  A  performance  problem  with  vector  clocks  results  from  size:  time- 
stamps  have  n  entries;  comparing  timestamps  requires  n  comparisons.  Charron-Bost’s  result 
[CB91]  that  partial  order  timestamps  must  be  linear  suggests  two  approaches  to  improving  per¬ 
formance:  implementing  vector  clocks  more  carefully  (to  reduce  the  actual  data  transmitted),  and 
trading  timestamp  size  for  comparison  time. 

Singhal  and  Kshemkalyani  [SiKs90]  present  a  vector  clock  implementation  where  processes 
refrain  from  transmitting  redundant  data  in  vectors.  Integrating  this  technique  with  Sealed  Vectors 
would  yield  increased  efficiency. 

A  more  generalized  approach  would  be  to  give  processes  more  latitude  in  choosing  which 
entries  to  transmit  and  which  to  withhold.  Some  entries  in  timestamp  vectors  might  be  marked 
with  flags  indicating  that  that  value  is  merely  a  lower  bound.  This  lower  bound  may  suffice  for  many 
comparisons;  if  it  does  not,  a  secure  coprocessor  would  need  to  consult  other  secure  coprocessors 
to  obtain  the  missing  data.  It  would  be  interesting  to  develop  good  heuristics  for  deciding  which 
entries  to  withhold  and  for  determining  when  the  expense  of  a  “miss”  outweighs  the  benefits  of 
withholding. 

Another  interesting  approach  would  be  to  implement  vector  clock  protocol  in  a  more  centralized 
fashion.  For  the  extreme  case,  suppose  we  had  a  single  trusted  logging  site.  When  a  process  receives 


a  message,  its  secure  coprocessor  sends  a  note  to  the  logging  site  indicating  the  sending  process, 
the  receiving  process,  and  the  local  indices  of  the  send  and  receive  events.  The  logging  site  then 
has  sufficient  information  to  maintain  the  timestamp  vectors  for  each  process.  We  obtain  constant 
size  timestamp  data  on  messages — at  the  price  of  doubling  the  number  of  messages,  and  having 
processes  need  to  consult  a  remote  site  to  perform  comparisons.  This  approach  still  requires 
coprocessor  sealing  in  order  to  force  a  process  not  only  to  acknowledge  receiving  a  message,  but 
also  to  file  a  logging  note.  (This  approach  differs  from  the  Causality  Server  protocol  [ReGo93]  in 
that  messages  are  not  routed  through  an  an  intermediary,  but  logged  after  the  fact,  that  no  FIFO 
nor  secure  channel  assumptions  are  needed,  and  that  the  logging  site  protocol  preserves  the  actual 
partial  order,  not  just  a  consistent  total  order.) 

Yet  another  technique  (e.g.,  [ACGS91])  is  to  use  vector  clocks  to  track  a  coarser  partial 
order — trading  timestamp  size  for  false  positives  in  precedence  detection.  However,  adapting 
these  techniques  (or  the  linear  timestamping  techniques  of  [BHS92,  HaSt91])  creates  the  problem 
of  proving  the  absence  of  a  precedence  path.  Developing  a  hierarchical  approach — to  indicate  the 
most  “likely”  precedence  path,  and  then  verify  its  correctness — is  one  path  of  future  research. 


General  Confinement  Models  Another  area  for  exploration  is  the  use  of  more  general  con¬ 
finement  models.  Coprocessor  sealing  provides  control  over  the  information  a  timestamp  provides 
to  a  process.  This  control  may  provide  more  benefits  than  Just  suppressing  vector  entries — in 
particular,  it  may  allow  for  anonymous  or  hidden  causality  [Gr75]. 
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Chapter  6 


Secure  Distributed  Time  for  Secure 
Distributed  Protocols 


6.1.  Overview 


Chapters  2  through  4  showed  how  framing  application  problems  in  terms  of  distributed  time  pro¬ 
vides  a  deeper  understanding  of  the  problems,  and  allows  the  development  of  flexible  and  general 
protocols  that  access  the  distributed  time  structure  by  querying  clock  primitives.  Separating  the 
clocks  from  the  higher-level  protocols  in  this  fashion  allows  us  to  change  the  clock  implementations 
transparently  to  the  higher-level  protocols.  However,  the  popular  timestamp  vector  implementation 
of  partial  order  clocks  suffers  from  security  and  privacy  risks,  as  Chapter  5  discussed. 

These  security  and  privacy  risks  for  timestamp  vectors  create  problems  for  higher-level  proto¬ 
cols  that  use  these  clock  implementations.  For  example,  malicious  clients  can  exploit  the  security 
and  privacy  risks  of  timestamp  vectors  in  order  to  subvert  the  immediate  ordered  service  protocol 
of  Section  2.5.2.  Standard  attacks  on  timestamp  vectors  translate  to  higher-level  protocol  attacks: 

•  Backdating  A  malicious  process  p  could  ensure  that  its  requests  receive  undue  priority 
by  backdating  the  timestamp  vectors  on  them. 

•  Postdating  Alternatively,  a  malicious  process  p  could  ensure  that  its  requests  always 
precede  those  from  an  honest  process  q  by  sending  postdated  vectors  on  its  messages  to  q. 

•  Privacy  A  malicious  process  could  use  the  timestamp  vectors  sent  as  part  of  the  protocol 
to  spy  on  the  activity  of  other  processes. 

Chapter  5  considered  two  approaches  to  provide  secure  clocks  for  the  partial  .order  .time 
model:  the  Signed  Vector  Timestamp  protocol  and  the  Sealed  Vector  Timestamp  protocol.  Using 
the  security  of  these  clocks  to  provide  security  for  higher-level  application  protocols  (such  as  those 
presented  in  Chapters  2  through  4)  raises  two  critical  issues: 

•  Do  the  security  properties  of  the  clocks  protect  the  application  protocols  against  clock-based 
attacks? 


•  Do  the  security  properties  of  the  clocks  hold  for  the  higher-level  time  models  considered  by 
some  application  protocols? 

For  example,  the  Signed  Vector  Timestamp  protocol  protects  immediate  ordered  service  only 
against  some  of  the  postdating  risks — with  Signed  Vectors,  a  malicious  process  must  confine  its 
postdating  to  entries  belonging  to  processes  whose  keys  it  knows.  The  Signed  Vector  Timestamp 
protocol  provides  even  less  protection  if  (due  to  failure  and  recovery)  the  partial  order  model  is 
flow-virtual.  On  the  other  hand,  the  Sealed  Vector  Timestamp  protocol  eliminates  all  three  risks. 

Chapter  6  examines  these  issues  of  security  and  privacy  for  higher-level  protocols  and  time 
models.  Section  6.2  explores  the  protection  that  our  secure  vector  protocols  provide  for  the  time 
models  considered  in  this  thesis.  Section  6.3  and  Section  6.4  consider  the  security  implications  for 
the  application  problems  of  distributed  snapshots  and  optimistic  rollback  recovery,  respectively. 


6.2.  Security,  Timestamps,  and  Time  Modeis 

Section  6.2.1  discusses  the  general  paradigm  behind  the  clock  schemes  proposed  in  this  thesis. 
Section  6.2.2  discusses  some  attacks  permitted  by  this  family.  Section  6.2.3  discusses  how  the 
defenses  proposed  in  Chapter  5  fare  against  these  attacks,  for  various  types  of  time  models. 


6.2.1.  Timestamp  Clocks 


The  clock  protocols  discussed  in  this  thesis  are  based  on  timestamps:  processes  generate  a  time- 
stamp  associated  with  an  event  or  state  A,  and  this  timestamp  serves  to  sort  A  relative  to  other 
events  or  states. 

Such  timestamp  clocks  are  easily  implemented  for  Type  4  parallel  pairs — pairs  that  are  con¬ 
sistent,  independent,  strongly  monotonic  and  flow-supported.  The  ease  of  implementation  follows 
from  these  properties: 

•  Strong  monotonicity  implies  the  relation  between  two  nodes  is  established  forever  once  they 
come  into  existence. 

•  Flow-support  implies  that  a  process  has  the  potential  to  know  all  information  necessary  to 
create  a  timestamp  for  a  node  when  the  node  comes  into  existence. 

For  example,  consider  generating  the  timestamp  vector  for  a  node  A  in  the  partial  .order  .time 
model.  The  timestamp  vector  V(/l)  is  well-defined  when  A  occurs,  due  to  strong  monotonicity: 
when  A  occurs,  all  the  nodes  that  precede  A  have  occurred,  and  their  precedence  is  established. 
The  timestamp  vector  V(A)  can  be  created  when  A  occurs,  due  to  flow-support:  information  paths 
exist  from  every  node  in  V(/l)  to  A. 
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Timestamp  clocks  can  also  be  implemented  for  some  Type  2  parallel  pairs — pairs  that  are  only 
guaranteed  to  be  consistent  and  independent.  These  pairs  lack  the  convenient  properties  of  Type  4, 
but  we  may  compensate: 

•  Weak  monotonicity  implies  that  when  a  precedence  relation  is  established  between  two  nodes, 
the  relation  holds  forever.  Consequently,  weak  monotonicity  coupled  with  a  way  to  determine 
when  all  such  relations  have  been  established  for  a  node  still  permits  a  timestamping  scheme. 

•  Strictly  speaking,  only  the  agents  that  create  timestamps  require  information  flow.  These 
agents  do  not  need  to  be  processes — for  example,  the  Sealed  Vector  Timestamp  protocol 
splits  clocks  from  processes. 


Before  we  can  discuss  example  implementations  for  time  models  other  than  Type  4  parallel 
pairs,  we  need  machinery  to  separate  clock  agents  (and  their  experience)  from  process  agents.  The 
tools  of  distributed  time  provide  an  easy  way  to  express  this  notion:  we  can  build  a  Type  4  parallel 
pair 

(clock  -  PARTIAL  _  ORDER,  CLOCK  .TIMELINES ) 

to  expre  he  computational  activity  and  information  flow  of  the  clock  agents.  We  consider  various 
pairs: 


•  For  the  partial. order.time  model  with  the  processes  themselves  implementing  clocks, 
the  clock  pair  above  is  the  same  as  (partial. order. TIME,  hmelines). 

•  For  the  system  .partial. ORDER  and  user,  partial.  ORDER  models,  if  processes  them¬ 
selves  implement  clocks,  then  the  clock  pair  is  the  same  as 

(system  -PARTIAL. order,  SYSTEM  .TIMELINES ) 

If  processes  use  separate  clock  processors,  then  the  clock  pair  is  the  partial  order  parallel  pair 
obtained  by  treating  clocks  as  separate  processes. 

Using  (CLOCK-PARTIAL-ORDER,CLOCK-TlMELlNES)  clarifies  the  discussion  of  when  we  can 
build  timestamp  clocks  for  a  weakly  monotonic  model  M.  Basically,  we  use  the  clock  computation 
to  simulate  strong  monotonicity  and  flow-support.  We.restate  the  earlier  conditions  in  these  terms: 


•  Simulated  Strong  Mono^^nlcity  Clocks  in  clock  .partial  .order  generate  time- 
stamps  for  nodes  A  and  in  M  only  when  the  relation  between  A  and  B  is  fixed. 

•  Simulated  Flow-Support  If  a  precedence  path  exists  from  node  A  to  node  B  in 
M{CUR.GRAPH),  then  a  precedence  path  exists  from  A  to  the  clock  node  that  generates  a 
timestamp  for  B. 
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For  example,  consider  generating  timestamp  vectors  for  the  STRONG -PARTIAL  .ORDER  model.  A 
send  event  5  depends  on  the  corresponding  receive  event  R,  only  no  information  path  exists  from  R 
toS.  As  a  result,  when  a  node  A  occurs,  the  information  necessary  to  create  V(A)  is  not  available, 
and  in  fact  V(  A)  may  not  even  be  well-defined.  However,  a  clock  coprocessor  could  keep  track  of 
the  set  X  of  receive  events  it  depends  on  but  does  not  know  about,  and  generate  for  A  an  interim 
timestamp  consisting  of  a  vector  V  and  this  set  X.  This  interim  timestamp  satisfies  the  invariant: 

V(4)  =  (  U  V(/J))  U  V 

Independently,  the  clock  coprocessors  share  interim  timestamp  information  for  receive  events  that 
have  occurred,  and  transform  interim  timestamps  to  reflect  this  new  information.  If  the  set  in  an 
interim  timestamp  for  node  A  becomes  empty  at  clock  node  Be  at  process  p,  then  all  nodes  that 
will  ever  precede  A  have  occurred,  and  information  paths  exist  to  Be  from  each  of  these  nodes. 
The  clock  at  process  p  may  then  generate  the  full  timestamp  vector  V(  A). 


Precedence  Horizone  The  timestamp  vector  protocols  use  timestamps  that  specify  precedence 
horizons:  the  timestamp  vector  for  node  A  consists  of  the  names  of  the  process-maximal  nodes  that 
precede  or  equal  A.  Such  precedence  horizons  function  as  clocks  for  parallel  pairs  where  processes 
can  sort  events  in  the  other  process’s  local  time  structures.  As  Chapter  4  described,  this  approach 
also  extends  to  restricted  subgraphs  of  nonlinear  pairs  (e.g.,  when  a  well-defined  valid  computation 
emerges  from  USER  .  partial  _  ORDER). 


6.2.2.  Attacks 

Clocks  based  on  precedence  horizons  have  three  distinct  tasks: 

•  Generating  Local  Tokens  A  clock  at  a  process  must  generate  a  local  token  for  each  of 
its  nodes.  This  token  may  an  integer  or  a  more  complex  identifier,  and  may  include  items 
such  as  signatures. 

•  Assembling  Timestamps  A  clock  at  a  process  must  assemble  sets  of  these  local  tokens 
into  a  global  timestamp. 

•  Disassembling  Timestamps  A  clock  at  a  process  must  disassemble  a  global  timestamp 
into  local  tokens,  some  of  which  may  be  re-used  when  assembling  subsequent  timestamps. 

These  tasks  generate  the  following  security  concerns; 

•  Is  a  given  local  token  correct?  Suppose  the  clock  at  process  q  has  a  token  for  node  A  at 
process  p.  Did  node  A  actually  occur?  Is  this  the  correct  token  for  A?  Should  the  clock  at  q 
even  possess  this  data? 
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•  Is  the  assembly  correct?  Clocks  are  supposed  to  follow  some  set  of  specified  rules  when 
assembling  timestamps.  Were  these  rules  followed? 

•  Is  the  information  released  by  disassembling  a  timestamp  confined  to  ^propriate  agents? 

These  concerns  create  opportunities  for  malicious  agents  to  attack  clock  protocols.  Chapter  5 
discussed  three  such  attacks.  Compromised  privacy  may  occur  when  agents  release  data  from 
disassembled  timestamps.  Violating  the  assembly  rules  (and  creating  fraudulent  tokens)  leads  to 
backdating  and  postdating  attacks;  these  violations  can  also  lead  to  concurrent-dating  attacks  in 
which  some  vector  entries  are  advanced  and  others  reduced. 

The  PARTIAL -ORDER -TIME  model  alone  provides  a  single  partial  order  with  straightline  graphs 
at  processes.  Departing  from  this  comfortable  world  permits  two  additional  attacks: 

•  Level'Mixing  When  we  deal  with  multiple  levels  of  time  without  adequately  distinguishing 
the  levels,  a  malicious  agent  may  assemble  timestamps  for  one  level  using  tokens  for  another. 

•  Branch-Mixing  In  nonlinear  pairs  such  as  user -partial -Order,  a  malicious  agent 
may  assemble  timestamps  using  at  least  one  token  from  an  incorrect  process  branch.  Such 
“sidedating”  places  an  event  in  a  computation  different  from  the  one  actually  occurring. 


6.2.3.  Defenses 

Signed  Vectors  The  Signed  Vector  Timestamp  protocol  requires  processes  to  implement  their 
own  clocks,  and  addresses  the  security  concerns  of  Section  6.2.2  by  using  cryptography  to  verify 
the  identity  of  the  process  creating  the  local  tokens.  Each  process  has  its  own  private  key;  multiple 
levels  of  processes  presumably  have  distinct  private  keys. 

This  approach  raises  two  significant  problems: 

•  The  protocol  restricts  only  identity,  not  time. 

•  The  security  of  the  protocol  rests  on  an  implicit  assumption  that  the  time  model  is  not 
flow-virtual:  that  information  flow  implies  precedence. 

We  now  consider  these  problems  in  more  detail.  The  Signed  Vector  Timestamp  protocol  leaves 
processes  completely  free  to  create  arbitrary  local  tokens.  This  flaw  permits  the  postdating  attacks: 
a  malicious  process  p  can  advance  its  own  local  counter,  sign  it,  and  pass  this  along  to  a  process 
q  as  the  “real”  value.  Processes  are  also  free  to  create  arbitrary  global  timestamps  from  the  local 
tokens  available.  This  flaw  permits  the  backdating  attacks  on  partial  .order -TIME:  a  malicious 
process  p  can  assemble  an  arbitrary  timestamp  from  the  signed  entries  it  possesses. 

The  Signed  Vector  Timestamp  protocol  also  implicitly  assumes  that,  barring  signature  compro¬ 
mise,  possession  of  a  signed  entry  for  a  node  implies  precedence  from  that  node.  Suppose  that  node 
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A  occurs  at  an  honest  process  p,  that  a  process  q  at  clock  node  Be  generates  a  timestamp  for  node  B, 
and  that  this  timestamp  includes  a  signed  entry  from  node  A.  For  Signed  Vectors,  this  is  sufficient 
evidence  to  conclude  that  A  — ^  ^  in  the  higher-level  model  M.  However,  all  we  are  justified 
in  concluding  is  that  Ac  — ►  Be  in  the  CLOCK. PARTIAL  _6Rl5Bf,  where  Ac  was  the  timestamp 
generation  event  for  A.  In  flow-virtual  time  models  (such  as  USER  .partial  .order),  precedence 
in  CLOCK -PAkriAL.ORDBI  will  not  imply  precedence  in  the  higher-level  M.  In  these  cases.  Signed 
Vectors  permit  branch-mixing  attacks.  (Figure  6.1  shows  a  simple  example.)  Branch-mixing  may 
even  masquerade  as  the  postdating  of  the  entries  belonging  to  honest  processes. 


SMted  Vectors  Using  secure  coprocessors  to  implement  clocks  allows  reliable  location  of 
tokens  in  both  space  and  time.  Using  secure  coprocessors  also  ensures  that  no  rules  are  broken  in 
the  assembly  and  disassembly  of  global  timestamps,  since  we  can  trust  the  secure  coprocessor  at 
any  process  to  track  a  local  counter  and  (barring  communication  subversion)  assemble  the  correct 
pieces  into  timestamps.  Secure  coprocessors  could  also  be  used  to  track  relations  in  a  model  M 
more  general  than  the  underlying  SYSTEM. partial. ORDER  model,  if  the  model  is  well-defined  in 
terms  of  the  SYSTEM  .partial,  order.  Thus  the  security  properties  of  Sealed  Vectors  extend  to 
models  such  as  the  user. partial. order  and  the  strong. partial. order. 


p' 


q: 


r. 


Figure  6.1  The  Signed  Vector  Timestamp  protocol  fails  for  flow-virtual  time  mod¬ 
els,  since  processes  may  retain  signed  entries  from  previous  lifetimes.  Suppose 
process  p  has  rolled  back  for  reasons  other  than  local  failure;  either  voluntarily, 
or  in  response  to  failure  at  another  process.  Process  p  at  node  A(,  can  forge 
user. partial. ORDER  dependence  on  nodes  Bi  through  B^  at  process  q  and  on 
node  Cl  at  process  r,  because  an  information  path  exists  from  A4  to  A(,.  Even 
giving  each  process  incarnation  its  own  private  key  does  not  remove  this  problem. 
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By  also  functioning  as  reliable  oracles  at  processes,  secure  coprocessors  make  new  techniques 
possible.  For  example,  the  secure  coprocessor  at  process  p  will  truthfully  list  a  complete  set  of 
nodes  at  p  satis^ing  some  particular  property  (provided  that  the  nodes  have  been  registered  with 
the  coprocessor,  and  that  the  property  is  something  that  the  coprocessor  has  sufficient  information 
to  evaluate). 


6.3.  Distributed  Snapshots 

Chapter  3  discussed  the  problem  of  taking  distributed  snapshots  in  terms  of  the  distributed  time 
framework.  This  discussion  took  two  paths:  using  clocks  for  partial  order  time  to  build  Round 
Robin  protocols  assembling  global  states,  and  using  such  snapshot  protocols  with  more  general 
time  models  in  order  to  capture  global  states  with  more  specific  properties.  Their  use  of  distributed 
time  clocks  makes  these  protocols  susceptible  to  the  security  and  privacy  risks — and  defenses — of 
Chapter  S. 

This  section  considers  these  issues.  Section  6.3.1  considers  active  attacks,  and  Section  6.3.2 
considers  passive  ones.  Section  6.3.3  discusses  the  security  and  privacy  implications  for  the 
distributed  time  snapshot  protocols  using  more  abstract  time  models. 


6.3.1.  Active  Attacks 

Distributed  snapshot  protocols  based  on  timestamp  vectors  inherit  their  security  risks.  Since  taking 
a  snapshot  requires  more  than  just  sorting  timestamps,  these  protocols  are  liable  to  some  additional 
risks  as  well.  That  is,  taking  a  distributed  snapshot  involves  two  somewhat  orthogonal  tasks: 

•  assembling  a  timeslice,  and 

•  obtaining  a  description  of  the  activity  on  this  timeslice. 

A  malicious  process  may  actively  attack  both  tasks. 


Attacking  Timeslice  Assembly  The  basic  Round  Robin  snapshot  protocol  of  Section  3.2. 1 
assembles  a  maximal  set  of  nodes  mutually  concurrent  in  the  transitive  global  time  model.  This 
basic  protocol  organizes  processes  into  a  directed  cycle.  Suppose  process  Pk  receives  a  set  i 
of  mutually  concurrent  nodes  from  Pi  through  Pk-i.  Process  Pk  is  supposed  to  add  one  of  its  own 
nodes  to  form  mutually  concurrent  set  Sk',  however,  process  Pk  may  act  instead  with  malice: 

•  Backdating  Process  Pk  could  forge  a  backdated  timestamp  for  some  node  A,  and  conse¬ 
quently  include  A  in  Sk  even  if  A  follows  some  B  e  Sk-\. 
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•  Postdating  Likewise,  process  Pk  could  foi^e  a  postdated  timestamp  for  some  node  A, 
and  consequently  include  A  in  Sk  even  if  A  precedes  some  B  €  Sk-\. 

This  protocol  gives  each  process  freedom  in  selecting  concurrent  nodes.  This  freedom  makes 
concurrency  detection  less  robust  against  attack.  The  Signed  Vector  Timestamp  protocol  does  not 
help:  a  malicious  process  Pk  can  select  arbitrary  signed  entries  from  the  timestamps  on  the  nodes 
in  set  Sk-i- 

The  Reduced  Round  Robin  snapshot  protocols  of  Section  3.2.2  achieve  better  performance 
than  this  basic  protocol;  this  improvement  exploits  shortcuts;  using  concurrency  information  that 
timestamp  and  rollback  vectors  already  contain.  These  shortcuts  sometimes  make  concurrency 
detection  more  resilient.  For  example,  suppose  a  malicious  process  p  fraudulently  wishes  to  insert 
a  node  A  into  a  snapshot  obtained  from  the  adjusted  timestamp  vector  V*(B)  of  node  B  at  process 
q.  Since  process  q  already  “knows”  the  identity  of  A  (from  the  timestamp  vector  for  B),  process 
p  must  manipulate  vectors  not  only  before  q  asks  for  the  snapshot,  but  also  before  B  even  occurs. 
Process  p  must  forge  the  right  sequence  of  outgoing  messages — and  must  hope  that  other  processes 
do  not  send  messages  that  dispel  the  illusion  that  the  node  preceding  A  at  p  is  the  p-maximal  node 
preceding  B.  On  the  other  hand,  taking  a  snapshot  using  an  adjusted  rollback  vector  R*(^)  does 
not  provide  as  much  resilience,  since  the  potential  delay  between  B  and  R*(S)  gives  malicious 
processes  more  flexibilit3^ 

Attacking  Descriptions  Taking  a  snapshot  usually  entails  more  than  just  collecting  a  set  of 
mutually  concurrent  timestamps;  we  also  want  a  description  of  the  activity  associated  with  each 
of  these  timestamps.  This  requirement  creates  another  avenue  of  attack:  a  malicious  process  may 
attack  a  snapshot  protocol  by  using  legitimate  timestamps  but  lying  about  the  nodes  and  activity  that 
belong  to  the  timestamps.  For  example,  in  the  Reduced  Round  Robin  snapshot  protocol,  an  honest 
process  q  might  ask  a  malicious  process  p  for  the  node  following  the  one  names  by  the  p  entry  in 
V(B),  for  a  node  B  at  process  q.  Protocols  such  as  Signed  Vectors  keep  the  timestamp  separate 
from  the  node  name — so  process  p  can  reply  to  q  with  the  proper  timestamp  for  the  requested  node, 
but  may  forge  the  name  and  description  of  the  node  itself. 

Defenses  Protecting  against  these  attacks  using  the  Sealed  Vector  Timestamp  protocol  is 
straightforward.  Sealed  Vectors  protect  against  foiging  timestamps  and  subverting  concurrency 
detection;  the  presence  of  a  trusted  agent  (the  secure  coprocessor)  to  link  timestamps  to  node  names 
protects  against  description  attacks. 

Effectively  protecting  against  these  attacks  without  using  secure  coprocessors  remains  a  re¬ 
search  area.  Expanding  Signed  Vectors  to  include  more  details  of  message  paths  might  make  them 
harder  to  forge.  However,  we  still  have  the  problem  that  (in  terms  of  Section  6.2.1)  the  ability  to 
assemble  legitimate  timestamps  easily  transforms  into  the  ability  fraudulent  timestamps.  Rather 
than  using  the  correct  set  of  local  tokens,  a  malicious  process  may  use  a  carefully  chosen  incorrect 
set. 


The  techniques  of  Haber  and  Stornetta  [HaSt9 1 ,  BHS931  provide  some  grounds  for  future  work. 
Cryptographic  linking  techniques  might  prevent  node-name  attacks  (honest  processes  can  prove 
their  allegation  that  a  given  node  follows  node  A),  but  using  these  techniques  requires  forcing 
processes  to  exchange  i.  n'ect  logging  information.  This  exchange  may  be  difficult  to  ensure 
without  the  trusted  local  agent  of  a  secure  coprocessor.  Pseudorandom  logging  techniques  may 
be  more  effective  in  these  situations — but  at  the  expense  of  increased  communication  and  delayed 
verification,  and  also  with  the  increased  risk  of  espionage  and  sabotage  that  come  with  remote 
logging. 


6.3.2.  Passive  Attacks 

Snapshot  protoccl;.  based  on  distributed  time  also  permit  passive  attacks — both  the  standard  time- 
stamp  vector  attacks,  and  new  ones  raised  by  the  snapshot  problem. 


Privacy  Distributed  snapshot  protocols  based  on  timestamp  vectors  inherit  their  privacy  risks: 
vectors  leak  information.  Problems  also  arise  from  observation  effects: '  the  interaction  between 
the  act  of  observing  and  the  computation  being  observed.  Do  the  messages  exchanged  as  part  of 
taking  a  snapshot  of  a  given  computation  belong  to  the  computation?  If  not,  then  the  data  being 
exchanged  creates  serious  potential  for  abuse.  Participating  in  such  a  snapshot  protocol  provides 
processes  with  valid  local  tokens  for  nodes  on  which  they  have  no  precedence;  a  malicious  process 
might  use  these  tokens  to  forge  timestamps.  For  example,  using  the  Signed  Vector  Timestamp 
protocol  here  would  distribute  signed  vector  entries  to  processes  that  have  no  dependence  on  the 
nodes  named  by  those  entries. 


Spying  on  the  Initiator  The  preceding  attacks  come  from  spying  on  the  data  exchanged  as 
part  of  a  snapshot  protocol.  A  malicious  process  may  also  gain  unauthorized  information  from  the 
fact  that  a  snapshot  protocol  is  being  executed.  For  example,  suppose  auditor  Alice  is  asking  for  a 
snapshot  to  verify  that  the  electronic  currency  in  circulation  sums  correctly.  If  counterfeiter  Bad 
Bob  knows  this  fact,  then  he  may  manipulate  this  probe  to  hide  his  crime  (and  subvert  the  purpose 
of  Alice  taking  this  snapshot). 


Spying  on  Other  Processes  A  distributed  snapshot  protocol  may  also  be  misused  by  its 
initiator  to  gain  unauthorized  information  about  other  processes  in  the  system.  Of  course,  an 
authorization  policy  must  exist  for  an  action  to  be  classified  as  misuse.  If  anyone  is  permitted  to 
take  any  kind  of  snapshot  at  any  time,  then  subversion  is  not  necessary.  However,  more  substantive 
authorization  rules  create  the  potential  for  both  direct  and  indirect  attacks.  A  malicious  process 
p  might  forge  its  own  authorization  proving  the  legitimacy  of  its  snapshot  request;  alternatively. 


'Section  3.4.2  discussed  this  issue. 
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a  malicious  process  p  might  spy  on  process  q  by  simulating  participation  in  a  legitimate  snapshot 
protocol  initiated  by  a  different  process. 


Defonses  One  way  to  add  resiliency  to  a  snapshot  protocol  is  to  require  each  participant  to 
identify  the  initiator.  However,  strengthening  a  protocol  by  including  extra  information  suggests 
a  fundamental  tradeoff  between  privacy  and  security:  information  included  is  information  leaked. 
As  with  the  active  attacks,  using  secure  coprocessors  appears  to  be  the  best  defense.  We  can  seal 
the  entire  snapshot  protocol,  and  also  use  secure  coprocessors  to  ensure  the  initial  requests  for 
snapshots  agree  with  whatever  policy  we  select  for  snapshot  authorization. 


6.3.3.  Alternative  Models 


Chapter  3  introduces  another  approach  to  obtaining  global  states  satisfying  some  particular  prop¬ 
erty:  taking  a  standard  snapshot  from  a  nonstandard  time  model.  Chapter  4  shows  how  at  least 
three  distinct  virtual  partial  orders  arise  from  rollback  with  modified  replay;  a  process  may  also 
wish  to  take  a  snapshot  from  one  these  alternative  models. 

This  approach  to  snapshots  follows  directly  from  the  orthogonality  between  clocks  and  higher- 
level  time  protocols.  However,  the  performance  orthogonality  between  clocks  and  protocols  does 
not  extend  to  a  security  orthogonality  between  clocks  and  models.  As  Section  6.2  discussed,  how 
a  temporal  relation  in  an  abstract  time  model  arises  from  the  real-time  partial  order  (e.g.,  is  it 
flow-virtual?)  influences  how  its  clocks  may  be  attacked. 


Blocked  Partial  Order  Time  Theorem  3.5  from  Section  3.3.2  repeats  a  result  from  [Sm93]: 
each  timeslice  from  a  Type  2  (consistent  and  independent)  parallel  pair  has  a  unique  subset  of 
nodes  that  determine  the  timeslice.  Since  these  subsets  are  partial  timeslices  from  the  composition 
of  the  BLOCKED  model  with  the  partial  order,  taking  snapshots  in  this  higher-level  model  yields  an 
exponential  number  of  snapshots  in  the  original  partial  order.  This  technique  creates  the  potential 
for  security  and  privacy  problems,  because  of  the  BLOCKED  model  itself,  and  because  we  have  two 
levels  of  time. 

One  problem  arises  because  of  the  lack  of  view-completeness.  Applying  blocked  to  a  Type  2 
parallel  pair  (M,M')  does  not  preserve  all  properties  of  (M,M');  in  particular,  we  lose  view- 
completeness  (as  Section  3.3.2  observes).  This  adds  a  wrinkle  to  the  Round  Robin  protocol:  a 
process  may  not  have  ai.y  node  to  add  to  the  partial  timeslice.  This  wrinkle  leads  again  to  a 
security-privacy  tradeoff:  if  we  do  not  require  such  a  process  to  provide  proof  of  its  necessary 
abstention,  then  we  allow  malicious  processes  to  opt  out  of  reporting  sensitive  data.  The  role 
of  secure  coprocessors  as  trusted  oracles  keeps  this  from  being  a  problem  for  the  Sealed  Vector 
Timestamp  protocol. 
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Flow-support  is  another  property  that  the  blocked  model  does  not  preserve.  Consider  a 
message  edge  S  — >  Rina,  graph  from  the  partial  .order  .time  model.  Composing  blocked 
with  this  model  draws  an  edge  to  R  from  the  local  successor  of  5,  but  an  information  flow  path 
does  not  exist.  This  is  not  a  serious  problem:  the  sending  process  can  inform  the  receiving  process 
of  the  identiflcation  of  the  next  local  node.  For  the  higher-level  model,  the  foundation  of  the 
Signed  Vector  Timestamp  protocol  still  holds:  possession  of  a  signature  for  an  honest  node  proves 
dependence  on  that  node.  We  may  thus  use  either  Signed  Vectors  or  Sealed  Vectors  to  track 
BLOCKED  0  PARTIAL-ORDER-TIME  relations. 

The  fact  that  two  distinct  levels  of  time  are  being  tracked  also  permits  level-mixing  at¬ 
tacks.  Consider  again  the  example  of  partial  .order  .time.  The  partial  .  order  _  time  and 
BLOCKED  o  PARTIAL. ORDER. TIME  models  describe  sufficiently  similar  structures  that  privacy  is 
not  a  problem.  However,  security  risks  might  still  exist:  for  example,  with  Signed  Vectors, 
timestamps  for  one  level  could  be  used  to  construct  timestamps  for  the  other.  As  Section  6.2.3 
observed,  the  use  of  Signed  Vectors  with  multiple  levels  requires  either  distinct  signature  func¬ 
tions  or  distinct  name  spaces.  (Otherwise,  our  trick  for  having  possession  imply  precedence  in 
BLOCKED  o  PARTIAL.  ORDER.  TIME  causes  this  property  to  fail  for  partial.  ORDER  .time.) 


Strong  Partial  Order  Time  We  developed  the  strong  model  to  allow  processes  to  take 
snapshots  in  which  no  messages  are  in  transit.  The  strong  model  composes  with  a  partial  order 
by  making  message  edges  bidirectional;  the  resulting  temporal  relation  has  the  property  that  its 
timeslices  are  exactly  the  timeslices  from  the  original  partial  order  in  which  no  messages  were  in 
transit. 

The  STRONG  model  also  alters  the  properties  of  the  model  to  which  it  is  applied.  For  example, 
STRONG  o  PARTIAL -ORDER. TIME  differs  from  Standard  partial  order  time  in  some  substantial  ways: 
edges  may  flow  backwards  in  time,  and  precedence  no  longer  implies  information  flow.  The  most 
substantial  difference  is  that  the  relation  possesses  cycles.  Making  message  edges  bidirectional 
ties  together  send  events  and  receive  events;  sets  of  messages  may  interact  in  unexpected  ways  to 
form  larger  cycles. 

The  cycles  in  the  STRONG  o  PARTIAL  .ORDER  .TIME  model  create  opportunities  for  malicious 
processes  to  attack  clock  and  snapshot  protocols.  As  Section  6.2.1  discussed,  clocks  must  keep 
track  of  the  incoming  edges  with  unknown  originating  nodes;  clocks  that  know  the  identity  of  these 
nodes  must  see  that  this  information  eventually  reaches  the  clocks  that  need  it.  Without  secure 
coprocessors  to  keep  them  honest,  malicious  processes  can  lie  on  both  ends  of  this  task,  and  spy 
on  the  information  itself. 

Malicious  processes  can  also  subvert  the  model  without  attacking  the  clocks  by  making  sure 
that  at  least  one  message  is  always  in  transit.  Figure  6.2  sketches  this  scenario. 
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Figure  6.2  Malicious  processes  may  subvert  the  strong  .partial. order  model 
by  ensuring  that  at  least  one  message  is  always  in  transit.  This  partial  .  order  _  time 
graph  illustrates  the  initial  phases  of  such  a  conspiracy  between  Bad  Bob  and 
Crooked  Cathy.  For  each  i,  Si  — >  Ri  and  S\  — >  R\  in  the  timelines.  However,  the 
STRONG  model  makes  message  edges  bidirectional,  so  applying  that  would  make 
i?,  — ►  S[  and  R\  — >  5..  Hence,  in  strong . partial  . order,  each  Bad  Bob  node 
from  5,  to  Ri  (inclusive)  is  cyclic,  and  cannot  be  part  of  a  timeslice.  Bad  Bob  and 
Crooked  Cathy  collaborate  to  ensure  that  any  Bad  Bob  node  from  S\  on  can  never 
be  part  of  a  strong  .partial  .order  timeslice. 


6.4.  Optimistic  Roiiback  Recovery 

The  optimistic  rollback  recovery  protocol  of  Chapter  4  uses  distributed  time  clocks,  and  thus  is 
liable  to  security  and  privacy  attacks  on  the  clock  mechanisms.  Section  6.4. 1  considers  standard 
attacks  on  clocks,  and  Section  6.4.2  considers  some  attacks  more  specific  to  optimistic  rollback 
recovery.  Many  of  these  issues  are  also  relevant  to  previous  rollback  protocols;  however,  by  its 
explicit  foundation  in  two  levels  of  partial  order  time,  our  protocol  is  a  particularly  appropriate 
scenario  to  discuss  these  issues. 


6.4.1 .  Standard  Attacks 

Chapter  5  discussed  three  risks  of  partial  order  clocks:  backdating,  postdating,  and  privacy  leaks. 
Section  6.2.2  discussed  the  additional  problems  of  level-mixing  (that  arises  when  an  application 
tracks  multiple  levels  of  time)  and  branch-mixing  (that  arises  when  an  application  deals  with  a 
nonlinear  pair).  These  risks  all  apply  to  our  optimistic  rollback  recovery  protocol,  which  uses  two 
levels  of  time:  a  nonlinear  pair  to  track  dependence  on  failed  nodes,  and  a  parallel  pair  to  track 
knowledge  of  rollback. 

The  SYSTEM.  PARTIAL.  ORDER  model  tracking  knowledge  of  rollbacks  is  a  standard  partial  order 
model,  producing  the  partial  order  that  an  external  observer  (unaware  that  recovery  is  taking  place) 
would  perceive.  The  standard  backdating,  postdating,  and  privacy  risks  apply. 
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In  this  context,  backdating  hides  knowledge  of  rollbacks;  the  malicious  process  falsely  deludes 
an  honest  process  into  thinking  a  node  is  not  an  orphan.  This  hoax  may  have  direct  consequences, 
such  as  the  honest  process  failing  to  roll  itself  back  or  to  discard  an  incoming  orphan  message,  or 
may  have  more  subtle  consequences,  such  undeceived  honest  processes  rejecting  messages  from 
the  deceived  honest  process.  Figure  6.3  shows  an  example  of  the  first  scenario;  Figure  6.4  shows 
an  example  of  the  second. 

Postdating  a  node  in  the  system -PARTIAL -ORDER  order  involves  forging  the  future.  Placing  a 
user  node  artificially  far  in  the  SYSTEM -PARTIAL -ORDER  future  allows  a  malicious  process  to  fool 
an  honest  one  into  accepting  orphan  messages.  Figure  6.5  shows  an  example.  The  Signed  Vector 
Timestamp  protocol  does  not  solve  these  problems,  but  the  Sealed  Vector  Timestamp  protocol  does. 

In  the  USER -PARTIAL -ORDER  model  tracking  dependence  on  failed  nodes,  backdating  and 
postdating  have  the  more  standard  behavior  of  hiding  or  forging  dependence  on  failed  nodes.  This 
model  behaves  like  the  standard  partial  order  until  rollback  actually  occurs. 


Branch-Mixing  Once  user  timelines  turn  into  timetrees,  the  user  _  partial  _  order  graph  may 
generate  a  valid  failure -FREE -Partial -ORDER  graph.  However,  the  fact  that  failure-free  graph 
is  flow-virtual  complicates  the  task  of  reliably  tracking  it.  As  Section  6.2.3  observed  and  Figure  6. 1 


Figure  6.3  Backdating  system -partial -order  relations  can  cause  honest 
processes  to  waste  computation.  In  this  example,  Alice’s  rollback  makes  Cathy 
an  orphan.  By  backdating  the  system  .partial -order  vector  on  his  first  message 
to  Cathy,  Bad  Bob  prevents  Cathy  from  learning  that  she  is  an  orphan  until  after  she 
has  performed  expensive  computation  that  now  must  be  discarded. 
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Rgure  6.4  If  knowledge  Ci  rollback  is  propagated  solely  on  system  messages 
carrying  user  messages,  then  backdating  system  .partial .order  can  cause  an 
honest  process  to  remain  an  orphan  indefinitely.  In  this  example,  Alice’s  rollback 
makes  Cathy  an  orphan.  By  backdating  his  system  .partial,  order  vector.  Bad  Bob 
prevents  Cathy  from  learning  this  fact.  All  of  her  subsequent  user  messages  will  be 
rejected — Cathy  loses  all  credibility  with  Doug, 
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ngura  6.5  Postdating  system,  partial  .order  relations  foois  honest  processes 
into  accepting  orphan  messages,  in  this  example,  Alice’s  rollback  makes  Cathy  an 
orphan.  By  advancing  his  Alice  entry  in  the  system  model  but  not  in  the  user  model, 
Bad  Bob  not  only  hides  the  rollback  from  Doug,  he  ensures  that  Doug  will  not  listen 
to  anyone  else’s  announcement  of  the  rollback. 
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illustrated,  the  Signed  Vector  Timestamp  protocol  breaks  down  when  the  a  graph  is  generated 
virtually — possession  of  a  signed  entry  for  a  node  no  longer  implies  dependence  on  that  node. 

The  Sealed  Vector  Tmestamp  protocol  still  provides  protection  in  this  scenario. 


Level-Mixing  One  way  to  subvert  a  protocol  that  requires  accurate  tracking  of  computation  on 
two  levels  is  to  disrupt  the  correspondence  between  the  levels.  For  example,  the  rollback  protocol 
from  Ch^ter  4  requires  that  processes  be  able  to  map  two  system  nodes  at  another  process  to  their 
corresponding  user  nodes,  and  be  able  to  sort  them  in  terms  of  the  USER  .partial  .order  model. 
How  can  this  mapping  be  made  reliable?  If  it  is  each  process’s  responsibility  to  report  a  node  as  a 
pair  of  identifiers,  then  malicious  processes  can  avoid  the  problem  of  forging  timestamps  merely 
by  mismatching  valid  timestamps. 

Again,  the  Sealed  Vector  Timestamp  protocol  still  provides  protection  in  this  scenario. 

Privacy  Risks  Surviving  processes  may  need  to  roll  back  in  response  to  a  failure.  If  a  surviving 
process  is  malicious,  it  may  retain  and  exploit  old  state.  For  example,  one  process  in  a  poker  game 
may  mistakenly  reveal  a  card,  and  call  for  rollback.  How  do  we  ensure  the  other  processes  actually 
“forget”  the  revealed  card? 

Banking  systems  provide  another  example.  Suppose  Alice  deposits  a  large  check  for  Bad  Bob 
with  banker  Cathy.  Alice  then  discovers  that  all  her  activity  that  day  was  incorrect,  and  rolls  herself 
back.  The  current  state  at  Cathy  indicates  that  a  large  sum  of  money  is  in  Bad  Bob's  account — but 
Alice's  failure  makes  this  state  an  orphan.  If  Bad  Bob  learns  that  Cathy  is  an  orphan  before  Cathy 
does,  then  Bad  Bob  can  exploit  the  incorrect  balance  by  withdrawing  the  extra  money. 

To  solve  this  problem,  we  need  to  introduce  a  complete  discontinuity  in  the  state  of  surviving 
processes  that  roll  back.  Perhaps  we  could  force  a  site  migration,  and  keep  the  location  of  the  new 
site  secret  from  the  old  site.  We  may  need  to  extend  this  discontinuity  to  any  process  learning 
of  rollback;  the  banking  example  did  not  specify  whether  Bad  Bob’s  own  state  was  an  orphan. 
This  problem  raises  similar  issues  as  commitment,  since  transfer  of  knowledge  is  an  action  that  is 
difficult  to  undo. 


6.4.2.  Other  Avenues  of  Attack 

Our  framework  of  secure  distributed  time  provides  protection  against  clock-based  security  and 
privacy  risks.  However,  optimistic  rollback  recovery  protocols  face  other  security  risks.  In  this 
section  consider,  we  discuss  some  of  these  areas  for  future  research. 


Checkpointing  Rollback  protocols  assume  some  mechanism  for  processes  to  restore  state. 
Usually  this  mechanism  uses  stable  storage  to  preserve  sufficient  information  for  state  restoration. 
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This  information  may  consist  of  checkpointed  images  of  local  state,  logs  of  incoming  messages 
(for  replay),  logs  of  outgoing  messages  (for  replay),  or  various  combinations  of  these  techniques. 

The  existence  and  use  of  this  logged  information  creates  security  and  privacy  risks: 


•  Forging  Identity  A  malicious  process  can  forge  someone  else’s  checkpoint. 

•  Forging  Data  A  malicious  process  can  lie  about  the  data  it  stores  as  its  own  checkpoint. 

•  Forging  Timestamps  In  protocols  that  preserve  more  than  just  the  most  recent  checkpoint 
at  each  process,  a  malicious  process  can  attack  the  methods  used  to  identify  which  checkpoint 
belongs  to  what  point  in  (distributed)  time. 

•  Forging  Storage  Location  If  stable  storage  servers  are  distributed  throughout  the 
system,  a  method  must  exist  that,  upon  recovery  of  process  q,  specifies  where  the  checkpoint 
for  q  is  saved.  A  malicious  process  can  disrupt  recovery  by  leaving  q's  checkpoints  untouched 
and  attacking  this  mapping  instead. 

•  Espionage  A  malicious  process  p  might  gain  unauthorized  knowledge  about  the  affairs 
of  process  q  by  examining  checkpoints  belonging  to  q. 

•  Interactions  The  checkpointing  policy  at  a  physical  process  site  cannot  be  completely 
orthogonal  to  the  levels  of  processes  at  that  site.  For  example,  in  a  system  using  Signed 
Vectors,  checkpointing  a  physical  process  would  leak  keys  belonging  to  the  system-level 
processes  (since  the  checkpoint  would  include  these  keys). 

•  Authority  The  authority  to  read  a  checkpoint  belonging  to  a  process  q  must  be  more  than 
just  the  identity  of  q  or  its  physical  site,  since  both  these  may  vanish  in  a  failure. 


Restart  As  the  last  item  above  suggests,  the  mechanics  of  restart — especially  in  the  face  of 
failure  of  physical  machines — creates  risks: 

•  Proving  Legitimacy  of  Request  How  does  a  system  establish  the  legitimacy  of  a 
restart  request?  For  example,  consider  the  standard  mechanism  of  a  process  p  calling  for 
a  restart  of  process  q  if  process  q  has  been  silent  for  a  while.  (After  all,  indefinite  silence 
is  indistinguishable  from  failure.)  A  process  r  that  has  heard  from  q  recently  can  veto  this 
request.  Once  more,  we  face  a  tradeoff  between  security  and  privacy:  preventing  malicious 
vetoes  requires  that  process  r  present  evidence  (e.g.,  a  timestamp)  showing  it  has  heard  from 
q — which  allows  process  p  to  probe  the  behavior  of  processes  r  and  q  by  “innocently”  calling 
for  restart. 

•  Malicious  Restart  A  malicious  process  might  be  able  to  abuse  the  restart  mechanism  by 
convincing  a  sufficient  quorum  of  processes  that  an  honest,  non-fauity  process  q  is  dead. 
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•  Malicious  Termination  Even  with  no  malice,  two  versions  of  the  same  process  may  be 
alive  simultaneously  because  silence  and  failure  are  indistinguishable.  Rollback  implemen¬ 
tations  thus  must  include  a  way  to  terminate  honest  processes.  A  malicious  process  might  be 
able  to  abuse  this  machinery  and  terminate  inconvenient  honest  processes. 

•  Directing  Migration  to  a  Corrupted  Site  When  an  honest  process  q  is  restarted  (either 
naturally,  or  through  malice),  a  corrupt  process  p  might  be  able  to  direct  the  restarted  version 
to  a  physical  site  that  p  has  compromised — thus  gaining  access  to  data  and  authority  of 
process  q. 

•  Mutual  Restart  Even  if  a  majority  of  processes  must  agree  to  restart  a  silent  process,  a 
coterie  of  malicious  processes  could  partition  the  honest  processes  and  convince  each  half  to 
restart  the  other. 

•  Migration  of  Authority  Security  and  privacy  techniques  (such  as  the  Signed  Vector 
Timestamp  protocol  and  the  Sealed  Vector  Timestamp  protoc  ^1)  may  assume  that  each  honest 
process  possesses  secret  keys.  Realistic  rollback  protocols  allow  for  a  process  to  migrate  to 
a  new  physical  site  when  the  original  physical  site  fails.  How  does  the  new  version  of  the 
process  obtain  the  proper  keys?  If  backup  copies  of  the  keys  exist — even  in  a  shared  fashion 
[Sh79] — what  protects  them?  If  new  keys  are  created,  what  prevents  a  malicious  process 
from  inventing  and  inserting  new  keys? 

•  Revoking  Authority  If  a  site  is  compromised,  how  do  the  surviving  honest  processes 
revoke  its  authority?  Can  the  revocation  mechanism  be  used  to  attack  honest  processes? 

•  Migration  of  Identity  When  a  process  migrates  to  a  new  site,  how  does  it  convince  other 
processes  of  its  new  identity?  Can  this  mechanism  be  abused  to  steal  the  identity  of  an  honest 
process? 


Rollback  Even  if  we  take  care  of  these  attacks  on  a  rollback  recovery  protocol,  the  protocol 
can  still  be  subverted  simply  by  misusing  it.  For  example,  a  malicious  process  can  prevent  the 
entire  system  from  ever  getting  any  work  done  simply  by  repeatedly  sending  messages  to  honest 
processes  (establishing  dependence)  and  then  rolling  itself  back. 
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Chapter  7 
Conclusion 


7.1.  Summary 

Distributed  time  provides  a  general  framework  for  building  distributed  protocols  and  for  transpar¬ 
ently  adding  security  and  privacy  protection  to  these  protocols.  This  thesis  demonstrated  these 
claims  in  three  steps: 

•  We  developed  formal  machinery  to  express  the  general  temporal  relations  that  arise  in  dis¬ 
tributed  application  problems.  We  then  built  a  suite  of  clock  primitives  for  these  relations. 

•  We  analyzed  application  problems  in  terms  of  these  general  temporal  relations.  We  then 
built  protocol  solutions  using  the  clock  primitives.  The  orthogonality  between  clocks  and 
protocols  allows  transparently  modifying  the  protocols  by  changing  clocks  and  time  models. 

•  We  identified  the  security  and  privacy  risks  inherent  to  tracking  general  temporal  relations, 
and  built  clock  primitives  that  protect  against  these  risks.  We  then  provided  security  to  our 
higher-level  protocols  by  transparently  substituting  these  secure  clocks. 

By  providing  insight  into  the  underlying  temporal  relations  and  orthogonality  between  clocks 
and  protocols,  the  distributed  time  framework  permits  us  to  build  protocols  that  are  more  general, 
more  flexible,  and  more  secure  than  previous  solutions.  Furthermore,  the  security  and  privacy 
problems  we  identify — and  the  solutions  we  provide — also  apply  to  less  general  frameworks. 

Computational  environments  are  becoming  increasingly  distributed,  and  applications  are  per¬ 
meating  social  and  financial  arenas  that  are  particularly  sensitive  to  security  and  privacy  attacks. 
The  problems  that  this  framework  addresses  are  likewise  becoming  increasingly  important. 


7.1 .1 .  Distributed  Time 

Distributed  time  improves  on  previous  work  in  partial  order  time  by  providing  a  fully  general 
temporal  framework  supporting  protocol  design  and  construction. 


We  defined  a  computation  graph  format  to  describe  computation,  and  showed  how  to  translate 
system  traces  into  ground-level  computation  graphs.  We  then  defined  time  models  as  representa¬ 
tional  transf(Mtnations  of  computation  graphs,  and  constructed  a  suite  of  clock  primitives  probing 
relations  in  these  models. 

Computation  graphs  allow  us  to  consider  temporal  relations  more  general  than  partial  orders — 
for  example,  non-transitive  relations  and  cyclic  relations.  Time  models  provide  a  formal  means  to 
abstract  away  irrelevant  temporal,  physical,  and  computational  detail.  The  ability  to  compose  time 
models  permits  us  to  build  hierarchies  of  temporal  abstraction.  The  separation  between  time  models 
and  computation  graphs  allows  us  to  consider  computations  that  arise  virtually,  via  composition  of 
models. 


7.1.2.  Distributed  Protocols 


Distributed  time  supports  protocol  construction  by  providing  an  understanding  of  the  general 
temporal  relations  underlying  ^plication  problems,  and  by  allowing  processes  to  examine  these 
relations  via  clock  primitives.  This  thesis  illustrates  this  support  by  applying  the  framework  of 
distributed  time  to  three  application  problems:  detecting  potential  causality,  and  the  more  advanced 
examples  of  distributed  snapshots  and  optimistic  rollback  recovery. 

Distributed  time  permits  accurate  detection  of  potential  causality  in  asynchronous  distributed 
systems.  Determining  whether  one  event  potentially  influenced  another  reduces  to  querying  a  clock 
primitive.  The  orthogonality  in  our  framework  permits  transparent  extension  of  protocols  using 
these  clock  queries.  For  example,  changing  the  time  model  used  in  the  clock  primitives  permits 
departing  from  real-time  partial  orders,  allowing  the  detection  of  potential  causality  in  a  distributed 
computation  that  (perhaps  via  rollback  and  modified  replay)  never  physically  occurred. 

By  expressing  temporal  and  computational  abstractio. ,  distributed  time  provides  a  framework 
for  taking  distributed  snapshots.  A  timeslice  in  a  well-constructed  time  model  represents  an 
instantaneous  global  state  of  the  system  in  some  underlying  computation;  clock  primitives  directly 
support  assembly  of  timeslices.  This  framework  permits  increased  flexibility.  For  example,  we 
can  take  snapshots  of  the  past,  and  (by  using  higher-level  time  models)  we  can  take  snapshots  with 
specific  properties. 

By  expressing  multiple  levels  of  temporal  and  computational  abstraction,  distributed  time  pro¬ 
vides  a  framework  for  optimistic  rollback  recovery.  This  problem  involves  two  distinct  distributed 
computations:  the  user  application  level  and  the  system  recovery  level.  The  ability  to  model  both 
levels  permits  us  to  build  an  optimistic  rollback  recovery  protocol  that  allows  processes  to  fully 
exploit  all  potential  information.  Our  new  optimistic  rollback  recovery  protocol  is  the  first  to 
provide  both  fully  asynchronous  recovery  and  optimality  in  the  number  of  individual  rollbacks  at 
processes.  In  particular,  we  reduce  the  previous  worst  case  for  asynchronous  optimistic  rollback 
recovery  from  exponential  to  at  most  one  rollback  per  process  after  any  failure. 
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7.1.3.  Security  and  Privacy 


Tracking  temporal  relations  more  general  than  real  time  creates  security  and  privacy  risks.  This 
thesis  identified  these  risks  and  constructed  clock  primitives  that  protect  against  them.  Because 
our  framework  provides  orthogonality  between  clocks  and  protocols,  using  secure  clocks  can 
transparently  provide  security  for  higher-level  protocols. 

Unlike  the  passage  of  real  time,  the  general  temporal  relations  of  distributed  time  cannot  be 
verified  independently.  Processes  must  share  private  information,  and  must  trust  the  information 
that  is  shared  with  them.  This  necessity  of  trust  creates  the  potential  for  security  and  privacy  risks 
for  clocks  for  these  relations:  malicious  processes  may  sabotage  the  clocks  at  honest  processes  by 
providing  false  information,  and  may  spy  on  honest  processes  by  misusing  the  information  that 
honest  processes  provide.  These  risks  for  clocks  translate  to  risks  for  protocols  based  on  these 
clocks — such  as  the  application  protocols  presented  in  this  thesis,  or  other  protocols  based  on 
querying  temporal  relations  such  as  partial  order  tin^. 

The  proposal  document  for  this  thesis  opened  these  questions  and  presented  the  Signed  Vector 
Timestamp  protocol,  the  first  to  provide  security  for  partial  order  time  clocks.  This  thesis  used 
cryptogr^hic  and  secure  coprocessor  techniques  to  develop  the  Sealed  Vector  Timestamp  protocol 
that  provides  full  security  and  privacy  for  time  models  more  general  than  the  standard  real-time 
partial  order.  The  generality  and  security  of  these  techniques  provides  security  and  privacy  pro¬ 
tection  for  higher-level  protocols  built  on  these  clocks.  For  example,  we  can  provide  immediate 
ordered  service,  take  distributed  snapshots,  and  recover  from  failure — while  also  protecting  against 
espionage  and  Byzantine  attacks. 


7.1 .4.  A  Single  Arena  for  Time  and  Security 


Previous  work  has  used  partial  order  time  to  analyze  distributed  computation  and  to  construct 
distributed  protocols.  However,  many  distributed  applications  center  on  temporal  relations  more 
general  than  a  single  level  of  a  partial  order.  Furthermore,  separate  applications  may  center  on 
temporal  relations  that  are  separate  but  related. 

The  time  hierarchies  and  secure  clocks  developed  in  this  thesis  provide  a  single  framework  to 
consider  these  separate  issues.  This  framework  allows  us  to  integrate  applications  and  solutions 
developed  independently.  For  example,  rollback  with  modified  replay  changes  the  underlying 
computation.  By  formally  specifying  how  the  rollback  protocol  changes  the  virtual  partial  order 
computation,  and  by  writing  snapshot  protocols  in  terms  of  queries  to  clocks  for  a  specific  partial 
order  time  model,  the  distributed  time  framework  lets  us  take  snapshots  without  worrying  about 
rollback.  Furthermore,  rollback  creates  several  layers  of  partial  order  time;  we  can  use  distributed¬ 
time  based  snapshot  protocols  to  take  snapshots  of  each  layer.  This  framework  also  allows  us 
to  consider  the  security  and  privacy  issues  for  these  various  levels  of  time,  independently  of  the 
particular  applications  and  protocols. 
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7.2.  Future  Work 


Future  research  in  secure  distributed  time  includes  developing  and  testing  new  techniques  for  secure 
clocks,  and  using  this  framework  to  solve  the  time  and  security  problems  in  new  application  areas. 
Section  7.2.1  considers  some  areas  of  work  in  clock  techniques,  and  Section  7.2.2  discusses  some 
particular  ^plication  problems.  However,  the  shape  and  scope  of  computation  is  changing,  and  the 
questions  of  security  and  privacy  are  becoming  more  urgent  and  harder  to  specify.  Section  7.2.3 
offers  some  speculation  on  the  fundamental  role  that  the  secure  distributed  time  framework  may 
play  in  this  emerging  world. 


7.2.1.  Future  Work:  Techniques 


Discussions  of  new  clock  and  security  techniques  and  areas  for  future  research  have  occurred 
throughout  this  thesis.  We  summarize  some  of  them  here. 

One  of  the  principal  drawbacks  of  any  of  the  timestamp  vector  techniques  is  vector  size:  a 
vector  has  one  entry  for  each  process  in  the  system,  and  these  entries  may  even  have  nonconstant 
size.  As  Section  5.5  discussed,  this  property  makes  scalability  a  significant  concern:  how  can  these 
techniques  extend  to  large  networks?  Avenues  to  solve  this  problem  include  adapting  cryptographic 
linking  and  distributed  trust  techniques  [HaSt91,  BHS93],  exploiting  secure  logging  sites,  and 
developing  good  heuristics  for  when  information  can  be  omitted.  The  discussion  of  timestamp 
clocks  in  Section  6.2.1  raises  an  additional  question:  do  effective  clock  techniques  exist  that  depart 
from  the  timestamp  approach  in  any  substantial  way? 

Some  areas  for  security  research  include  limiting  potential  damage  when  secure  coprocessors 
are  compromised  (for  example,  fleshing  out  the  Give-and-Forget  approach  of  Section  5.5),  de¬ 
tecting  covert  communication  between  malicious  processes,  and  exploring  what  privacy  can  be 
attained  without  coprocessors.  Another  area  for  work  is  formulating  effective  privacy  policies  for 
coprocessor-based  clocks  such  as  Sealed  Vectors.  Op  a  basic  level,  we  need  to  develop  formal 
(and  enforceable)  rules  for  precedence  querying.  How  does  process  p  limit  the  use  of  timestamps 
it  generates?  How  will  this  policy  limit  what  a  malicious  process  may  learn  via  selective  probing? 
On  a  more  advanced  level,  we  need  to  develop  policies  for  snapshots  and  rollback  that  grant  the 
initiator  some  degree  of  anonymity  while  also  establishing  the  initiator’s  authority. 

These  issues  all  hint  at  a  fundamental  tradeoff  between  security  and  privacy.  Including  sufficient 
data  in  protocols  to  prevent  malicious  tampering  creates  privacy  as  well  as  efficiency  concerns.  Is 
this  tradeoff  unavoidable? 


7.2.2.  Future  Work:  Applications 


The  framework  of  secure  distributed  time  is  a  powerful  tool  for  solving  application  problem  that 
depend  on  temporal  relations  more  general  than  real  time.  As  Section  7. 1  discussed,  this  thesis 
applied  this  tool  to  the  problems  of  potential  causality,  distributed  snapshots,  and  optimistic  rollback 
recovery.  However,  the  framework  of  secure  distributed  time  may  be  appropriate  for  many  other 
application  problems.  We  now  consider  some  of  these  problems,  and  discuss  the  possible  relevance 
of  our  framework  and  possible  topics  for  future  research. 


Rollback  Our  research  into  optimistic  rollback  recovery  suggests  many  directions  for  future 
work.  One  direction  is  exploring  the  area  of  commitability,  especially  in  for  situations  where 
rollback  recovery  may  be  initiated  for  reasons  other  than  process  failure.  Another  direction  is 
examining  the  list  in  Section  6.4.2  of  possible  security  attacks  on  optimistic  rollback  recovery 
protocols. 

A  third  direction  consists  of  exploring  more  general  versions  )f  the  problem:  real  world 
applications  provide  motivation  for  rolling  back  rollback.  For  example,  users  of  word  processors 
and  graphics  packages  frequently  attempt  to  UNDO  previous  UNDO  commands.  As  Section  4.4 
discussed,  rolling  back  rollback  requires  considering  general-past  versions  of  the  problem.  It 
would  be  interesting  to  develop  effective  distributed  techniques  for  these  scenarios. 


Distributed  Nested  Transactions  Natural  experience  provides  many  examples  of  atomic 
actions:  consider  Alice  physically  giving  a  ten  dollar  bill  to  Bob.  This  exchange  is  either  success¬ 
fully  completed,  or  it  never  happened.  No  intermediate  views  of  this  action  are  possible. 

Transactions  (e.g.,  [GrRe93])  are  a  standard  tool  for  providing  this  abstraction  in  distributed 
systems.  Without  this  framework,  situations  such  as  process  failure,  unreliable  communication, 
and  interactions  between  concurrent  transactions  may  cause  pathological  behavior.  A  particular 
subcomputation  may  be  distributed  in  time  and  space,  and  consequently  may  be  susceptible  to 
many  failures.  However,  a  transactional  system  guarantees  atomicity,  consistency,  independence, 
and  durability;  a  programmer  may  regard  these  subcomputations  as  durable,  atomic  actions  that 
appear  to  happen  in  some  linear  sequence. 

Essentially,  transactions  perform  temporal  and  computational  abstraction.  During  a  transaction, 
certain  processes  may  perceive  individual  steps  occurring  in  a  certain  order;  everyone  else  must 
perceive  these  actions  as  an  atomic  unit.  Nested  transactions  allow  additional  levels  of  abstraction 
by  permitting  transactions  to  call  lower-level  transactions.  Supporting  nested  transactions  requires 
managing  the  interactions  between  child  and  parent  transactions.  One  aspect  of  this  management  is 
orphan  elimination  (e.g.,  [HLMW87]);  when  a  transaction  is  aborted,  all  subtransactions  executing 
on  its  behalf  must  also  be  aborted. 

Distributed  time  provides  tools  to  support  such  temporal  and  computational  abstraction.  Hierar¬ 
chies  of  time  models  permit  the  multi-level  view  necessary  to  implement  individual  steps  as  part 
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of  a  single  unit,  and  to  support  nested  transactions.  Cyclic  time  models  support  atomicity  in  a 
distributed  environment.  Partial  order  models  support  tracking  dependency,  for  orphan  elimination. 
Consequently,  distributed  time  might  provide  a  nice  framework  to  implement  distributed  nested 
transactions.  As  an  extra  benefit,  we  can  transparently  provide  security  and  privacy  protection 
to  these  implementations.  Fitting  existing  implementation  into  our  framework  and  showing  they 
already  face  these  security  risks  would  be  an  interesting  research  topic. 


Electronic  Cuirrency  Adapting  the  familiar  paradigm  of  physical  cash  to  a  distributed  elec¬ 
tronic  environment  raises  a  number  of  challenges.  Many  properties  of  physical  cash  fail  in  the 
more  general  case  of  electronic  currency. 

As  the  initial  example  for  transactions  showed,  natural  experience  with  cash  implicitly  uses 
transactional  behavior.  For  example,  a  dollar  bill  is  a  unique  physical  token;  a  faulty  physical 
transaction  will  never  cause  this  token  to  be  duplicated.  However,  electronic  transmission  of  a  data 
packet  (in  general)  leaves  the  sender  with  a  copy  of  the  packet.  Further,  electronic  interactions 
may  be  subject  to  network  and  process  failures.  For  example,  if  the  communication  line  breaks 
while  transferring  cash,  what  happens  to  the  cash?  Consequently,  robust  electronic  currency  must 
provide  fully  transactional  behavior;  as  we  have  discussed,  the  abstraction  tools  of  distributed  time 
have  relevance  to  this  area. 

The  temporal  tools  of  distributed  time  also  have  relevance  to  electronic  currency.  For  example, 
bookkeeping  and  auditing  practices  in  the  real  world  are  implicitly  based  on  the  notion  of  per¬ 
ceivable  real-time  simultaneity.  However,  perceivable  simultaneity  is  one  of  the  first  casualties  of 
asynchronous  distribution.  The  framework  of  distributed  time  provides  support  for  obtaining  and 
reasoning  about  possible  simultaneous  states,  and  consequently  provides  support  for  performing 
auditing  and  bookkeeping  along  timeslices. 

Additionally,  the  framework  of  secure  distributed  time  provides  protection  from  acts  of  sabotage 
and  espionage  that  money  seems  to  motivate  in  humans.  If  communication  failures  may  cause 
cash  to  be  created  (or  destroyed),  then  malicious  agents  will  simulate  communication  failures.  If 
incorrect  values  in  timestamp  vectors  prevent  discovery  of  illegal  cash  activities  (as  Section  6.3 
discussed),  then  malicious  agents  will  use  incorrect  values  in  timestamp  vectors. 

The  ability  to  model  multiple  levels  of  abstraction  while  providing  some  assurances  of  security 
and  reliability  provides  the  distributed  time  framework  with  another  class  of  potential  applica¬ 
tions:  balancing  privacy  of  transactions  with  government  law.  For  example,  the  Internal  Revenue 
Service  may  have  the  right  to  examine  and  verify  certain  aspects  of  the  flow  of  electronic  currency. 
Integrating  time  models  expressing  currency  flow  with  time  models  expressing  knowledge  rights 
might  provide  the  necessary  tools. 


Electronic  Exchanges  Performing  commodities  trading  on  public  networks  raises  a  number 
of  challenges  (e.g.,  [SEC94]).  As  the  examples  of  Chapter  5  indicate,  these  applications  are 
susceptible  to  many  explicitly  clock-based  security  problems,  since  adversaries  may  reap  rich 
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rewards  by  subverting  clock  protocols.  How  do  we  accurately  detect  the  order  of  actions?  How 
do  we  prevent  unauthorized  leaking  of  information?  The  framework  of  secure  distributed  time 
provides  a  foundation  for  exploring  these  issues. 

Considering  real-world  scenarios  raises  even  more  questions.  For  example,  how  do  we  enforce 
real-time  fairness?  Even  though  Alice  in  Albuquerque  and  Bad  Bob  in  New  York  do  not  receive 
news  of  a  stock  offer  at  the  same  real  time,  they  should  have  the  same  duration  of  time  to  consider 
the  offer.  If  Alice  responds  one  second  after  she  receives  the  news  and  Bad  Bob  responds  two 
seconds,  Alice’s  response  should  take  precedence,  even  though  it  may  have  been  sent  after  Bob’s 
was  sent.  (The  real-time  order  in  which  the  responses  arrive  is  yet  another  issue.)  These  questions 
suggest  another  interesting  research  topic:  incorporating  real  time  into  the  framework  of  distributed 
time. 


Capabilities  Management  A  capability  is  an  explicit  authorization  granting  its  bearer  certain 
rights.  As  computation  becomes  more  distributed  and  asynchronous,  the  problem  of  capabilities 
management  becomes  more  complex.  If  Alice  performs  a  task  on  behalf  of  Bob,  how  does  Alice 
inherit  the  necessary  capabilities?  How  does  Bob  later  revoke  the  capabilities  he  has  transferred  to 
Alice! 

Capabilities  management  in  distributed  systems  raises  several  issues  related  to  partial  order 
time.  We  list  some  examples: 

•  Managing  capability  inheritance  requires  tracking  a  relation  very  similar  to  precedence  paths. 

•  Revoking  capabilities  requires  modifying  these  paths. 

•  Enforcing  access  rules  requires  using  these  paths  to  restrict  the  user-level  computation. 


The  framework  of  distributed  time  provides  tools  for  constructing  and  and  tracking  hierarchies 
of  partial  order  time  relations.  In  capabilities  management,  these  tools  should  apply  both  to 
the  time-like  relations  of  capabilities,  and  to  the  interaction  of  these  relations  with  time  models 
describing  computation.  In  particular,  our  framework  may  have  relevance  to  the  earlier  examples: 


•  Tracking  which  agents  inherit  which  capabilities  requires  distributed  construction  of  a  di¬ 
rected  acyclic  graph  (e.g.,  [He91]).  We  could  examine  this  problem  in  the  distributed  time 
framework  by  building  time  models  that  express  authority  instead  of  temporal  precedence. 

•  Inheritance  complicates  revocation.  Revoking  a  capability  given  to  Alice  should  also  revoke 
that  capability  from  anyone  who  has  inherited  that  capability  (even  indirectly)  from  Alice. 
However,  this  revocation  should  not  result  in  any  other  capabilities  being  revoked.  If  we 
express  capabilities  using  a  time  model,  then  revocation  reduces  to  optimistic  rollback  re¬ 
covery.  Consequently,  the  framework  of  distributed  time  may  have  some  bearing  on  this 
problem. 
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•  In  order  for  capabilities  to  have  meaning,  their  possession  (or  lack  thereof)  should  affect 
the  underlying  computation.  Having  access  both  to  a  time  model  describing  the  unfolding 
computation  and  also  to  a  time  model  expressing  authority  might  provide  a  way  to  express 
the  semantics  of  capabilities;  having  access  to  clocks  for  these  models  might  provide  a  way 
to  implement  these  semantics. 


Enforcing  access  rights  via  capabilities  is  meaningless  if  malicious  agents  can  subvert  the  under¬ 
lying  management  system.  Using  the  framework  of  distributed  time  has  the  additional  benefit  of 
providing  transparent  security  and  privacy. 

The  framework  of  distributed  time  also  provides  the  potential  for  integrating  capabilities  with 
other  temporal  issues.  For  example,  Herlihy  and  Tygar  [HeTy89]  use  an  approximation  of  real 
time  as  a  basis  for  revoking  capabilities.  Distributed  time  might  provide  a  way  to  generalize  such 
work  based  on  linear  time.  For  example,  how  would  capabilities  be  temporarily  restored  during 
rollback  with  modified  replay? 


Information  Confinement  A  long-considered  issue  (e.g.,  [La73])  in  computer  security  is  the 
confinement  of  information  to  appropriate  agents.  Indeed,  some  researchers  regard  information 
confinement  to  be  synonymous  with  security  (e.g.,  [NCSC90]). 

Tracking  the  flow  of  information  in  order  to  enforce  information  confinement  is  an  area  in 
which  secure  distributed  time  has  particular  relevance.  This  relevance  arises  for  many  of  the 
same  reasons  as  in  the  capabilities  management  problem.  Information  confinement  requires  using 
causality-like  relations  both  to  describe  and  also  to  proscribe  computation,  and  requires  careful 
consideration  of  temporal  abstraction  and  security  issues.  The  framework  of  distributed  time  may 
provide  appropriate  tools: 


•  The  ability  to  express  partial  order  time  allows  the  potential  to  track  correctly  who  has  seen 
what. 

•  The  ability  to  support  flow-virtual  time  allows  us  to  extend  this  potential  for  computations 
whose  history  is  altered. 

•  The  ability  to  support  multiple  levels  of  time  allows  us  to  track  independent  flows. 

•  The  ability  to  support  relations  more  general  than  partial  orders  allows  us  to  consider  the  se¬ 
mantics  of  computation  while  tracking  information  flow.  For  example,  nontransitive  relations 
express  process  actions  that  destroy  data. 


Mobile  Computing  The  advent  of  mobile  computing  raises  a  number  of  challenges  related 
both  to  abstraction  and  to  security.  The  framework  of  distributed  time  might  provide  tools  for  these 
issues. 
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Besides  distribution  and  asynchrony,  mobile  computing  raises  an  additional  level  of  abstraction; 
networks  with  mobile  agents  must  abstract  from  a  dynamic  physical  topology  to  the  more  stable 
logical  topology.  Providing  a  hierarchy  of  time  models  might  express  this  abstraction;  providing 
clocks  for  these  models  might  facilitate  protocol  construction.  The  dynamic  physical  topology 
creates  additional  security  risks:  for  example,  if  a  mechanism  exists  for  Alice  to  appear  suddenly 
in  Cedar  Springs,  Michigan  and  have  her  communication  routed  to  the  server  there,  then  Bad 
Bob  in  New  York  may  abuse  this  mechanism  to  intercept  i4//ce!s  communication.  Expressing  the 
abstraction  using  distributed  time  may  provide  techniques  that  transparently  protect  against  these 
risks. 

Disconnected  operation  also  raises  challenges  with  time  and  with  confinement.  For  example, 
if  a  partition  temporarily  distributes  Alice ’s  computation  among  several  physical  sites,  then  she 
must  re-establish  a  consistent  image  upon  repair  of  the  partition.  We  might  be  able  to  address  this 
problem  using  the  consistent  global  state  tools  of  distributed  time.  Remote  execution  subverts  the 
standard  client/server  model,  since  a  portion  of  the  client’s  computation  may  run  on  the  server’s 
machine,  or  vice-versa.  Sharing  a  machine  creates  a  mutual  suspicion  problem:  one  computation 
might  interfere  with  or  spy  on  the  other.  (For  example,  Trojan  Horses  and  viruses  are  examples  of 
such  attacks.)  The  security  and  privacy  tools  of  distributed  time  might  address  this  problem. 


Hidden  Causality  Real  distributed  systems  frequently  provide  the  potential  for  anonymous  or 
hidden  causality  [Gr75].  The  semaphore  mechanism  is  an  example:  the  agent  granted  a  lock  by 
a  semaphore  knows  neither  who  released  the  lock  nor  who  else  is  waiting  for  it.  Extending  our 
framework  to  handle  these  issues  would  be  an  interesting  research  area.  Suppose  Alice  releases  a 
lock  which  the  semaphore  grants  to  Bob.  Does  Bob  now  depend  on  Alicel  Would  vector  clocks 
leak  the  identity  of  Alice  to  Bob'!  If  Alice  fails  and  rolls  back,  what  should  Bob  do? 


Distributed  Optimistic  Execution  Much  research  (including  the  recent  work  of  Leon  [LFS93] 
and  Cowan  [CLB94])  has  explored  the  uses  of  highly  optimistic  execution  in  distributed  environ¬ 
ments.  For  example,  allowing  long-running  application  programs  to  execute  based  on  speculation 
(and  to  roll  back  if  the  speculation  proves  false)  may  provide  increased  performance  (if  the  specu¬ 
lation  is  correct  sufficiently  often).  The  abstraction  tools  of  distributed  time  framework  may  handle 
this  distribution  and  the  multiple  time  levels  that  may  arise  in  such  implementations;  the  security 
tools  might  provide  transparent  tolerance  against  Byzantine  attacks  on  the  machinery  of  optimism. 


7.2.3.  A  Framework  for  the  Future 

The  abstraction  from  linear  time  to  partial  orders  and  beyond  has  a  precedent  in  the  shift  in  physics 
from  the  classical  world-view  to  the  relativistic  world-view.  The  comfortable,  familiar  perspective 
fails  when  simultaneity  of  information  vanishes.  The  right  perspective  clarities  otherwise  baffling 
behavior  and  also  provides  a  way  to  continue  to  apply  the  comfortable  perspective,  once  formal 
tools  exist  for  changing  frames  of  reference. 
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Linear  time  does  not  adequately  describe  the  behavior  of  distributed  systems.  When  failure 
recovery  is  allowed,  a  single  level  of  partial  order  time  also  does  not  suffice — time  must  have  depth 
as  well  as  width.  Levels  of  temporal  abstraction  will  only  become  more  necessary  as  computational 
environments  become  more  complex,  such  as  by  admitting  mobile  agents,  anonymous  paths  of 
influence,  and  the  potential  for  cross-channel  communication.  Multiple  levels  of  abstraction  will 
multiply  the  problems  of  specifying  and  providing  security  and  privacy  in  protocols  running  in 
such  environments. 

However,  understanding  how  virtual  partial  orders  arise  from  a  hierarchy  of  time  levels  allows  us 
to  model  the  underlying  behavior  of  the  system,  and  to  relativize  the  protocols  and  tools  developed 
for  the  comfortable  world  of  partial  order  time.  Understanding  how  partial  order  protocols  relate 
to  the  underlying  time  models  also  allows  us  to  relativize  the  security  challenges  of  timekeeping. 

This  thesis  provides  the  fundamental  contributions  of  a  framework  to  understand  the  general 
temporal  relations  and  the  concomitant  security  challenges  that  arise  and  will  continue  to  arise  in 
distributed  computation. 
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Glossary 


Terms 


acyclic 

when  a  node  does  not  precede  itself  in  a  graph; 
when  a  time  model  produces  graphs  with  no 
cycles;  when  the  global  model  in  a  parallel  or 
nonlinear  pair  is  acyclic 

15.  18 

adjusted  rollback  vector 

the  vector  obtained  from  the  rollback  vector  for 
a  node  by  moving  the  other  entries  to  their 
maximal  preceding  acyclic  nodes 

33 

adjusted  timestamp  vector 

the  vector  obtained  from  the  timestamp  vector 
for  a  node  by  moving  the  other  entries  to  their 
minimal  following  acyclic  nodes 

33 

antichain 

in  an  order,  a  set  of  mutually  incomparable 
elements 

26 

atom 

a  node  or  edge  in  a  graph 

11 

bit-secure 

cryptographic  functions  that  leak  no  information 

120 

commitable 

a  state  or  event  that  will  never  be  rolled  back 

73 

complete  recoverability 

the  assumption  that  any  state  in  the  live  history 
at  a  non-faulty  process  is  recoverable 

70 

computation  graph 

a  directed  graph  di^scribing  computation 

11 

concurrent 

when  a  time  model  leaves  two  nodes  unordered 

11 

consistent  cut 

a  cut  that  is  also  a  timeslice 

31 

consistent  pair 

a  parallel  or  nonlinear  pair  that  is  view-complete 
and  transitively -bounded 

20 

consistent  set 

a  set  of  user  nodes  whose  live  histories  together 
comprise  a  past-closed  prefix  of  a  graph  from 
FAILURE  -  FREE  _  PARTIAL  _  ORDER 

87 

cut 

a  set  of  nodes,  exactly  one  at  each  process 

31 

digital  signature 

externally  equivalent 
factoring  model 
flow-supported 

flow-virtual 

global  state 

ground-level  computation 
graphs 

independent 

join 
lattice 
live  history 
maximal  node 


a  function  that  only  a  privileged  agent  can  1 20 

perform,  but  anyone  can  check 

when  two  atoms  at  a  process  afford  the  same  20 

external  view 

in  a  nonlinear  or  parallel  pair,  the  model  induced  1 8 
from  the  local  model  to  the  global  model 

when  transitive  precedence  in  a  model  implies  17,18 


information  flow  in  any  underlying  computation; 
when  each  model  in  a  parallel  or  nonlinear  pair 
is  flow-supported 

when  information  flow  does  not  necessarily  17,  1 8 

imply  time  model  precedence;  when  the 
transitive  closure  of  each  model  in  a  parallel  or 
nonlinear  pair  is  flow-virtual 

in  a  ground-level  graph,  a  minimal  set  of  atoms  27 
that  represents  an  instant  of  time  in  an 
underlying  computation 

the  “least  abstract”  computation  graphs,  1 2 

constructed  directly  from  traces 

when  a  parallel  or  nonlinear  pair  has  the  property  20 
that  each  non-extremal  node  in  the  global  model 
represents  a  unique  node  in  the  local  model 

in  lattices,  the  least  upper  bound  of  two  33 

elements;  for  vectors,  the  entry-wise  maximum 

a  nonempty  ordered  set  closed  under  meet  and  33 
join 

a  user  node  along  with  its  past  from  the  78 

USER  _  PARTIAL  _  ORDER 

a  node  with  no  successors  1 1 


meet 

minimal  generating  set 


minimal  node 
modifled  replay 


in  lattices,  the  greatest  lower  bound  two  33 

elements;  for  vectors,  tbe  entry-wise  minimum 

in  a  tiiiieslice  X,  a  minimal  subset  of  nodes  55 

whose  adjusted  timestamp  vectors  join  to  yield 
timeslice  X 

a  node  with  no  predecessors  1 1 

when  the  computation  after  rollback  differs  from  66 
the  original  computation 
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node-monotonic  when  a  time  model  on  ground-level  graphs  has  1 7 

the  property  that  once  it  produces  a  node,  the 
node  never  vanishes 

nonlinear  pair  a  pair  of  related  models  providing  (respectively)  2 1 

system-wide  and  process-only  descriptions  of 
computation;  the  process  descriptions  need  not 
be  linear 

optimistic  rollback  approaches  that  bet  failure  will  not  happen,  and  68 

allow  orphans  to  develop  at  non-faulty  processes 

order  a  relation  that  is  transitive  and  antisymmetric  1 1 

orphan  a  node  that  depends  on  or  equals  a  rolled-back  64 

node 

parallax  when  two  snapshots  of  the  same  computation  58 

could  not  both  have  been  real  simultaneous  states 

parallel  pair  a  pair  of  related  models  providing  (respectively)  1 8 

system-wide  and  process-only  descriptions  of 
computation;  the  process  descriptions  must  be 
linear 

partial  timeslice  a  subset  (not  necessarily  proper)  of  a  timeslice  26 

past-closed  when  nodes  in  a  subgraph  have  the  same  history  1 2 

as  they  do  in  the  original  graph 

past-closure  a  subgraph  minimally  extended  to  make  it  1 2 

past-closed 

pessimistic  rollback  approaches  that  bet  failure  will  happen,  and  68 

prevent  orphans  from  developing  at  non-faulty 
processes 

piecewise  deterministic  when  a  process’s  computation  between  message  72 

receive  events  is  completely  determined  by  the 
state  before  the  first  receive  and  contents  of  the 
message 

prefix  a  subgraph  that  is  connected  and  that  contains  1 2 

the  minimal  nodes 

pseudo- vector  an  array  of  node  sets,  one  for  each  process  80 

public  key  cryptography  an  encryption  function  that  anyone  can  perform,  1 20 

but  only  a  privileged  agent  can  invert 

refinement  time  model  M|  refines  to  time  model  M2  when  15 

Ml  (a)  =  M|(q')  always  implies 
M2(q)  =  M2(a') 
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representation  map 

rollback  vector 
stable 

state  interval 
strongly  edge-monotonic 

strongly  monotonic 

time  model 
timeslice 

timestamp  pseudo- vector 

timestamp  vector 
timetree 

trace 

transitively  bounded 

Type  1 


a  function  (induced  by  the  application  of  a  time 
model  to  a  graph)  that  takes  atoms  in  the  image 
graph  back  to  sets  of  atoms  in  the  original  graph 

13 

the  minimal  nodes  at  each  process  that  follow  or 
equal  a  given  node 

32 

when  a  property  remains  true  once  it  becomes 
true;  when  a  state  or  event  has  been  successfully 
logged  to  stable  storage 

44,  73 

a  sequence  of  states  and  events  representing  a 
period  of  deterministic  execution  at  a  process 

105 

when  a  time  model  on  ground-level  graphs  has 
the  property  that  once  it  produces  two  nodes,  the 
relation  between  those  nodes  is  fixed 

17 

when  a  time  model  is  node-monotonic  and 
strongly  edge-monotonic;  when  the  transitive 
closure  of  each  model  in  a  parallel  or  nonlinear 
pair  is  strongly  monotonic 

17,  18 

a  representative  transformation  of  computation 
graphs 

13 

a  maximal  set  of  mutually  concurrent  (hence 
acyclic)  nodes 

26 

in  a  nonlinear  pair,  the  maximal  nodes  in  the 
local  model  that  precede  or  equal  a  given  node  in 
the  global  model 

80 

the  maximum  node  at  each  process  that  precedes 
or  equals  a  given  node 

32 

the  tree-structure  on  a  process’s  events  that 
emerges  instead  of  timelines  in  the 

USER  _  PARTIAL  _  ORDER 

78 

an  exhaustive  real-time  description  of  a 
computation 

10 

when  the  transitive  closure  of  a  model  produces 
unique  maximal  and  minimal  nodes;  when  the 
global  model  in  a  parallel  or  nonlinear  pair  is 
transitively  bounded 

15,  18 

a  parallel  or  nonlinear  pair  that  is  consistent 

20 

Type  2 


a  parallel  or  nonlinear  pair  that  is  consistent  and  20 
independent 
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l^pe  3  a  parallel  or  nonlinear  pair  that  is  consistent,  20 

independent,  and  strongly  monotonic 

Type  4  a  parallel  or  nonlinear  pair  that  is  consistent,  20 

independent,  strongly-monotonic,  and 
flow-supported 

valid  when  the  live  history  of  a  user  node  is  a  87 

past-closed  prefix  of  a  graph  from  the 
FAILURE  -  FREE  -  PARTIAL  -  ORDER 

vector  an  array  of  nodes,  one  from  each  process  3 1 

view-complete  when  a  graph  from  the  global  model  in  a  pair  has  20 


the  property  that  for  any  edge  at  process,  there 
exists  a  node  that  is  externally  equivalent  in  the 
transitive  global  graph;  when  a  parallel  or 
nonlinear  pair  always  produces  view-complete 
graphs 

weakly  edge-monotonic  when  a  time  model  on  ground-level  graphs  has  1 7 

the  property  that  once  it  produces  an  edge,  the 
edge  never  vanishes 

weakly  monotonic  when  a  time  model  is  node-monotonic  and  17,18 

weakly  edge-monotonic;  when  the  transitive 
closure  of  each  model  in  a  parallel  or  nonlinear 
pair  is  weakly  monotonic 


Clock  Primitives 


ACYCUC 

clock  primitive  testing  if  a  node  is  acyclic 

35 

COMPARE 

clock  primitive  comparing  two  vectors 

36 

CONCURRENT 

clock  primitive  testing  if  two  nodes  are 
concurrent 

34 

CUR. GRAPH 

universal  variable  for  current  ground-level  graph 

34 

CUR.NODE 

clock  primitive  returning  current  node 

36 

UST 

clock  meta-primitive,  listing  nodes  from  a 
specified  graph  with  a  specified  property 

36 

UST. CONCURRENT 

clock  primitive  listing  all  nodes  at  a  process 
concurrent  with  a  given  node 

37 
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MAX 

clock  primitive  returning  the  entry-wise 
maximum  of  two  vectors 

36 

NEXT 

clock  primitive  returning  the  node  following  a 
given  node  at  a  process 

37 

NODE 

clock  meta-primitive  returning  the  unique  node 
from  a  specified  graph  with  a  specified  property 

36 

PRECEDES 

clock  primitive  testing  if  one  node  precedes 
another 

34 

PREVIOUS 

clock  primitive  returning  the  node  preceding  a 
given  node  at  a  process 

37 

SEND.  EVENT 

clock  primitive  returning  the  send  event  of  a 
given  message 

82 

SYSTEM 

clock  primitive  mapping  a  user  node  to  the  set  of 
system  nodes  it  represents 

81 

USER 

clock  primitive  mapping  a  system  node  to  its 
user  node 

81 

USER.MESSAGE 

clock  primitive  extracting  the  user  message  from 
a  system  message  carrying  one 

82 

USER. MESSAGE.  TEST 

clock  primitive  testing  if  a  system  message 
carries  a  user  message 

82 

USER. VECTOR 

clock  primitive  mapping  a  vector  of  system 
nodes  to  a  vector  of  user  nodes 

82 

Time  Models 


BLCXTKED 

time  model  expressing  when  the  presence  of  one 
node  in  a  minimal  generating  set  for  a  timeslice 
blocks  the  presence  of  another 

55 

CLOCK  _  PARTIAL  _  ORDER 

partial  order  time  model  expressing  the 
experience  of  clock  agents 

131 

CLOCK -TIMELINES 

timelines  time  model  expressing  the  experience 
of  clock  agents 

131 

FAILURE  -  FREE  _  PARTIAL  _  ORDER 

partial  order  time  model  defined  only  for  traces 
of  executions  of  failure-free  implemented 

83 

processes 

162 


IMPLEMENT  time  model  expressing  how  to  construct  the  77 

USER.  PARTIAL -ORDER  from  the 
SYSTEM  -  PARTIAL  .  ORDER 

NET .  ABSTRACT  time  model  abstracting  away  network  activity  2 1 

PARTIAL  -ORDER. TIME  time  model  organizing  process  activity  into  a  25 

partial  order 

STRONG -PARTIAL -ORDER  a  “partial  order”  time  model  with  bidirectional  S3 

message  edges 

STRONG  time  model  making  cross-process  edges  S3 

bidirectional 

SYSTEM  -  PARTIAL  -  ORDER  partial  order  time  model  for  the  recovery  76 

computation 

SYSTEM -TIMELINES  timelines  time  model  for  the  recovery  76 

computation 

TIMELINES  time  model  organizing  process  activity  into  22 

timelines 

TIMETREES  “timelines”  time  model  expressing  logical  local  78 

precedence  for  user  computation — hence 
process  structure  is  a  tree,  not  a  line 

TRANS  time  model  performing  transitive  closure  1 8 

USER  -  PARTIAL  .  ORDER  partial  order  time  model  that  examines  only  the  78 

state  of  the  implemented  process,  and  expresses 
logical  precedence 

Symbols 

A  — >  B  node  A  precedes  node  B  1 1 

A  B  node  A  precedes  or  equals  node  B  1 1 

A  <-/->  B  nodes  A  and  B  are  incomparable  1 1 

Ml  >  M2  model  Mi  refines  to  model  M2  15 

( M,  Q )  the  representation  map  induced  by  applying  1 3 

model  M  to  graph  o 

M  the  transitive  closure  of  model  M  1 8 
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(«.] 

a  sequence  of  unfolding  ground-level  graphs 
representing  a  computation  in  progress 

17 

the  set  of  ground-level  gr^h  sequences  that,  at 
some  point,  generate  0  through  M 

17 

(M,M') 

a  parallel  or  nonlinear  pair;  M  is  the  global 
system  model  and  M'  is  the  local  process  model 

18,21 

M/M' 

the  factoring  model  for  pair  (M,  M') 

18 

TTpA- 

the  process  p  entry  of  X 

31 

R{A) 

the  rollback  vector  of  A 

32 

R-{A) 

the  adjusted  rollback  vector  of  A 

33 

\{A) 

the  timestamp  vector  of  A 

32 

V(A) 

the  adjusted  timestamp  vector  of  A 

33 

V'(A) 

the  timestamp  pseudo- vector  of  A 

80 

X 

timeslice  X  precedes  timeslice  Y 

32 

xnY 

the  meet  of  X  and  Y 

33 

XUY 

the  join  of  X  and  Y 

33 
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